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Preface 


Most of the observable phenomena in the empirical sciences are of a multivariate nature. 
In financial studies, assets in stock markets are observed simultaneously and their joint 
development is analyzed to better understand general tendencies and to track indices. In 
medicine recorded observations of subjects in different locations are the basis of reliable 
diagnoses and medication. In quantitative marketing consumer preferences are collected in 
order to construct models of consumer behavior. The underlying theoretical structure of 
these and many other quantitative studies of applied sciences is multivariate. This book 
on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate 
data analysis with a strong focus on applications. 


The aim of the book is to present multivariate data analysis in a way that is understandable 
for non-mathematicians and practitioners who are confronted by statistical data analysis. 
This is achieved by focusing on the practical relevance and through the e-book character of 
this text. All practical examples may be recalculated and modified by the reader using a 
standard web browser and without reference or application of any specific software. 


The book is divided into three main parts. The first part is devoted to graphical techniques 
describing the distributions of the variables involved. The second part deals with multivariate 
random variables and presents from a theoretical point of view distributions, estimators 
and tests for various practical situations. The last part is on multivariate techniques and 
introduces the reader to the wide selection of tools available for multivariate data analysis. 
All data sets are given in the appendix and are downloadable from www.md-stat.com. The 
text contains a wide variety of exercises the solutions of which are given in a separate 
textbook. In addition a full set of transparencies on www.md-stat.com is provided making it 
easier for an instructor to present the materials in this book. All transparencies contain hyper 
links to the statistical web service so that students and instructors alike may recompute all 
examples via a standard web browser. 


The first section on descriptive techniques is on the construction of the boxplot. Here the 
standard data sets on genuine and counterfeit bank notes and on the Boston housing data are 
introduced. Flury faces are shown in Section 1.5, followed by the presentation of Andrews 
curves and parallel coordinate plots. Histograms, kernel densities and scatterplots complete 
the first part of the book. The reader is introduced to the concept of skewness and correlation 
from a graphical point of view. 
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At the beginning of the second part of the book the reader goes on a short excursion into 
matrix algebra. Covariances, correlation and the linear model are introduced. This section 
is followed by the presentation of the ANOVA technique and its application to the multiple 
linear model. In Chapter 4 the multivariate distributions are introduced and thereafter 
specialized to the multinormal. The theory of estimation and testing ends the discussion on 
multivariate random variables. 


The third and last part of this book starts with a geometric decomposition of data matrices. 
It is influenced by the French school of analyse de données. This geometric point of view 
is linked to principal components analysis in Chapter 9. An important discussion on factor 
analysis follows with a variety of examples from psychology and economics. The section on 
cluster analysis deals with the various cluster techniques and leads naturally to the problem 
of discrimination analysis. The next chapter deals with the detection of correspondence 
between factors. The joint structure of data sets is presented in the chapter on canonical 
correlation analysis and a practical study on prices and safety features of automobiles is 
given. Next the important topic of multidimensional scaling is introduced, followed by the 
tool of conjoint measurement analysis. The conjoint measurement analysis is often used 
in psychology and marketing in order to measure preference orderings for certain goods. 
The applications in finance (Chapter 17) are numerous. We present here the CAPM model 
and discuss efficient portfolio allocations. The book closes with a presentation on highly 
interactive, computationally intensive techniques. 


This book is designed for the advanced bachelor and first year graduate student as well as 
for the inexperienced data analyst who would like a tour of the various statistical tools in 
a multivariate data analysis workshop. The experienced reader with a bright knowledge of 
algebra will certainly skip some sections of the multivariate random variables part but will 
hopefully enjoy the various mathematical roots of the multivariate techniques. A graduate 
student might think that the first part on description techniques is well known to him from his 
training in introductory statistics. The mathematical and the applied parts of the book (II, 
III) will certainly introduce him into the rich realm of multivariate statistical data analysis 
modules. 


The inexperienced computer user of this e-book is slowly introduced to an interdisciplinary 
way of statistical thinking and will certainly enjoy the various practical examples. This 
e-book is designed as an interactive document with various links to other features. The 
complete e-book may be downloaded from www.xplore-stat.de using the license key given 
on the last page of this book. Our e-book design offers a complete PDF and HTML file with 
links to MD*Tech computing servers. 


The reader of this book may therefore use all the presented methods and data via the local 
XploRe Quantlet Server (XQS) without downloading or buying additional software. Such 
XQ Servers may also be installed in a department or addressed freely on the web (see www.i- 
xplore.de for more information). 
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A book of this kind would not have been possible without the help of many friends, col- 
leagues and students. For the technical production of the e-book we would like to thank 
Jorg Feuerhake, Zdenék Hlavka, Torsten Kleinow, Sigbert Klinke, Heiko Lehmann, Marlene 
Miller. The book has been carefully read by Christian Hafner, Mia Huber, Stefan Sperlich, 
Axel Werwatz. We would also like to thank Pavel Cizek, Isabelle De Macq, Holger Gerhardt, 
Alena Mysiékova and Manh Cuong Vu for the solutions to various statistical problems and 
exercises. We thank Clemens Heine from Springer Verlag for continuous support and valuable 
suggestions on the style of writing and on the contents covered. 


W. Hardle and L. Simar 
Berlin and Louvain-la-Neuve, August 2003 
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Descriptive Techniques 


1 Comparison of Batches 


Multivariate statistical analysis is concerned with analyzing and understanding data in high 
dimensions. We suppose that we are given a set {x;}"_, of n observations of a variable vector 
X in R?. That is, we suppose that each observation x; has p dimensions: 


Li (Las, LiQs very Din) 


and that it is an observed value of a variable vector X € R?. Therefore, X is composed of p 
random variables: 


De a0. Ce, Cree Oo) 


where X;, for 7 = 1,...,p, is a one-dimensional random variable. How do we begin to 
analyze this kind of data? Before we investigate questions on what inferences we can reach 
from the data, we should think about how to look at the data. This involves descriptive 
techniques. Questions that we could answer by descriptive techniques are: 


e Are there components of X that are more spread out than others? 
e Are there some elements of X that indicate subgroups of the data? 
e Are there outliers in the components of X? 

e How “normal” is the distribution of the data? 


e Are there “low-dimensional” linear combinations of X that show “non-normal” behav- 
ior? 


One difficulty of descriptive methods for high dimensional data is the human perceptional 
system. Point clouds in two dimensions are easy to understand and to interpret. With 
modern interactive computing techniques we have the possibility to see real time 3D rotations 
and thus to perceive also three-dimensional data. A “sliding technique” as described in 
Hardle and Scott (1992) may give insight into four-dimensional structures by presenting 
dynamic 3D density contours as the fourth variable is changed over its range. 


A qualitative jump in presentation difficulties occurs for dimensions greater than or equal to 
5, unless the high-dimensional structure can be mapped into lower-dimensional components 
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(Klinke and Polzehl, 1995). Features like clustered subgroups or outliers, however, can be 
detected using a purely graphical analysis. 


In this chapter, we investigate the basic descriptive and graphical techniques allowing simple 
exploratory data analysis. We begin the exploration of a data set using boxplots. A boxplot 
is a simple univariate device that detects outliers component by component and that can 
compare distributions of the data among different groups. Next several multivariate tech- 
niques are introduced (Flury faces, Andrews’ curves and parallel coordinate plots) which 
provide graphical displays addressing the questions formulated above. The advantages and 
the disadvantages of each of these techniques are stressed. 


Two basic techniques for estimating densities are also presented: histograms and kernel 
densities. A density estimate gives a quick insight into the shape of the distribution of 
the data. We show that kernel density estimates overcome some of the drawbacks of the 
histograms. 


Finally, scatterplots are shown to be very useful for plotting bivariate or trivariate variables 
against each other: they help to understand the nature of the relationship among variables 
in a data set and allow to detect groups or clusters of points. Draftman plots or matrix plots 
are the visualization of several bivariate scatterplots on the same display. They help detect 
structures in conditional dependences by brushing across the plots. 


1.1 Boxplots 


EXAMPLE 1.1 The Swiss bank data (see Appendiz, Table B.2) consists of 200 measure- 
ments on Swiss bank notes. The first half of these measurements are from genuine bank 
notes, the other half are from counterfeit bank notes. 


The authorities have measured, as indicated in Figure 1.1, 
X, = length of the bill 


X_ = height of the bill (left) 
X3 = height of the bill (right) 


X4 = distance of the inner frame to the lower border 
X5 = distance of the inner frame to the upper border 
X¢g = length of the diagonal of the central picture. 


These data are taken from Flury and Riedwyl (1988). The aim is to study how these mea- 
surements may be used in determining whether a bill is genuine or counterfeit. 
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Figure 1.1. An old Swiss 1000-franc bank note. 


The boxplot is a graphical technique that displays the distribution of variables. It helps us 
see the location, skewness, spread, tail length and outlying points. 


It is particularly useful in comparing different batches. The boxplot is a graphical repre- 
sentation of the Five Number Summary. To introduce the Five Number Summary, let us 
consider for a moment a smaller, one-dimensional data set: the population of the 15 largest 
U.S. cities in 1960 (Table 1.1). 


In the Five Number Summary, we calculate the upper quartile Fy, the lower quartile F_, 
the median and the extremes. Recall that order statistics {a(1),v(2),...,@(n)} are a set of 
ordered values 21, %2,...,2n Where x1) denotes the minimum and x(,) the maximum. The 
median M typically cuts the set of observations in two equal parts, and is defined as 


(m1) n odd 
M= : Ll 
5 {as) + rasa} nm even val) 


The quartiles cut the set into four equal parts, which are often called fourths (that is why we 
use the letter F’). Using a definition that goes back to Hoaglin, Mosteller and Tukey (1983) 
the definition of a median can be generalized to fourths, eights, etc. Considering the order 
statistics we can define the depth of a data value xq as min{i,n —i+ 1}. If n is odd, the 


depth of the median is "+". If n is even, 2 is a fraction. Thus, the median is determined 
Pp 2 


2 
to be the average between the two data values belonging to the next larger and smaller order 


statistics, ic, M = 44 an) +a(244)>. In our example, we have n = 15 hence the median 
(3) (341) 


2 2 
M = 2@ = 88. 
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City Pop. (10,000) Order Statistics 
New York 778 L(15 
Chicago 355 L14 
Los Angeles 248 L13 
Philadelphia 200 L2 
Detroit 167 La 
Baltimore 94 X10 
Houston 94 xg 
Cleveland 88 xg 
Washington D.C. 76 X(7 
Saint Louis és L6 
Milwaukee 74 L(5 
San Francisco 74 LA 
Boston 70 £3 
Dallas 68 L2 
New Orleans 63 La 


Table 1.1. The 15 largest U.S. cities in 1960. 


We proceed in the same way to get the fourths. Take the depth of the median and calculate 


[depth of median] + 1 
2 


depth of fourth = 


with [z] denoting the largest integer smaller than or equal to z. In our example this gives 
4.5 and thus leads to the two fourths 


1 
Fr = 5 {ta +2} 
1 


Fo = 5{xqa1 + 2a2)} 
2 


(recalling that a depth which is a fraction corresponds to the average of the two nearest data 
values). 


The F’-spread, dp, is defined as dp = Fy — Fy. The outside bars 


Fy +1.5dp (1.2) 
Pp =i 8d, (1.3) 


are the borders beyond which a point is regarded as an outlier. For the number of points 
outside these bars see Exercise 1.3. For the n = 15 data points the fourths are 74 = 
5 {x(4) + x (5) } and 183.5 = 5 {x11 + X12) }- Therefore the F’-spread and the upper and 
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# 15 U.S. Cities 

M 8 88 

F 4.5 74. 183.5 
i 63 778 


Table 1.2. Five number summary. 


lower outside bars in the above example are calculated as follows: 


dp = Fy —F, =183.5—74= 109.5 (1.4) 
F,—1.5dp = 74—1.5-109.5 = —90.25 (1.5) 
Fy +15dp = 183.54+1.5- 109.5 = 347.75. (1.6) 


Since New York and Chicago are beyond the outside bars they are considered to be outliers. 
The minimum and the maximum are called the extremes. The mean is defined as 


n 
=n" y Dis 
i=l 


which is 168.27 in our example. The mean is a measure of location. The median (88), the 
fourths (74;183.5) and the extremes (63;778) constitute basic information about the data. 
The combination of these five numbers leads to the Five Number Summary as displayed in 
Table 1.2. The depths of each of the five numbers have been added as an additional column. 


Construction of the Boxplot 
1. Draw a box with borders (edges) at F, and Fy (i.e., 50% of the data are in this box). 
2. Draw the median as a solid line (|) and the mean as a dotted line (\). 


3. Draw “whiskers” from each end of the box to the most remote point that is NOT an 
outlier. 


4. Show outliers as either “x” or “e” depending on whether they are outside of Fy, +1.5dpr 
or Fy, + 3dr respectively. Label them if possible. 
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Boxplot 


778.00 x 


88.00 
63.00 


US cities 


Figure 1.2. Boxplot for U.S. cities. Q MVAboxcity.xpl 


In the U.S. cities example the cutoff points (outside bars) are at —91 and 349, hence we draw 
whiskers to New Orleans and Los Angeles. We can see from Figure 1.2 that the data are 
very skew: The upper half of the data (above the median) is more spread out than the lower 
half (below the median). The data contains two outliers marked as a star and a circle. The 
more distinct outlier is shown as a star. The mean (as a non-robust measure of location) is 
pulled away from the median. 


Boxplots are very useful tools in comparing batches. The relative location of the distribution 
of different batches tells us a lot about the batches themselves. Before we come back to the 
Swiss bank data let us compare the fuel economy of vehicles from different countries, see 
Figure 1.3 and Table B.3. 


The data are from the second column of Table B.3 and show the mileage (miles per gallon) 
of U.S. American, Japanese and European cars. The five-number summaries for these data 
sets are {12, 16.8, 18.8, 22,30}, {18, 22, 25, 30.5,35}, and {14, 19, 23, 25,28} for American, 
Japanese, and European cars, respectively. This reflects the information shown in Figure 1.3. 
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car data 


41.00 ie) 


1e) 
33.39 
25.78 
18.16 iL 


US JAPAN EU 


Figure 1.3. Boxplot for the mileage of American, Japanese and European 
cars (from left to right). @ MVAboxcar.xpl 


The following conclusions can be made: 


e Japanese cars achieve higher fuel efficiency than U.S. and European cars. 

e There is one outlier, a very fuel-efficient car (VW-Rabbit Diesel). 

e The main body of the U.S. car data (the box) lies below the Japanese car data. 

e The worst Japanese car is more fuel-efficient than almost 50 percent of the U.S. cars. 
e The spread of the Japanese and the U.S. cars are almost equal. 


e The median of the Japanese data is above that of the European data and the U.S. 
data. 


Now let us apply the boxplot technique to the bank data set. In Figure 1.4 we show 
the parallel boxplot of the diagonal variable X¢. On the left is the value of the gen- 
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Swiss bank notes 


142.40 
141.19 
139.99 
[e) 
138.78 
Oo 
i 
T 


GENUINE COUNTERFEIT 


Figure 1.4. The X¢ variable of Swiss bank data (diagonal of bank notes). 
Q MVAboxbank6.xpl 


uine bank notes and on the right the value of the counterfeit bank notes. The two five- 
number summaries are {140.65, 141.25, 141.5, 141.8, 142.4} for the genuine bank notes, and 
{138.3, 139.2, 139.5, 139.8, 140.65} for the counterfeit ones. 


One sees that the diagonals of the genuine bank notes tend to be larger. It is harder to see 
a clear distinction when comparing the length of the bank notes X1, see Figure 1.5. There 
are a few outliers in both plots. Almost all the observations of the diagonal of the genuine 
notes are above the ones from the counterfeit. There is one observation in Figure 1.4 of the 
genuine notes that is almost equal to the median of the counterfeit notes. Can the parallel 
boxplot technique help us distinguish between the two types of bank notes? 
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Figure 1.5. The X, variable of Swiss bank data (length of bank notes). 
Q MVAboxbank1.xpl 


ian Summary 


a 


The median and mean bars are measures of locations. 


| 


The relative location of the median (and the mean) in the box is a measure 
of skewness. 


[ 


The length of the box and whiskers are a measure of spread. 


The length of the whiskers indicate the tail length of the distribution. 


[ 


The outlying points are indicated with a “x” or “e” depending on if they 
are outside of Fy, +1.5dp or Fy, + 3dp respectively. 


The boxplots do not indicate multi modality or clusters. 
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Summary (continued) | 


<> If we compare the relative size and location of the boxes, we are comparing 
distributions. 


1.2 Histograms 


Histograms are density estimates. A density estimate gives a good impression of the distri- 
bution of the data. In contrast to boxplots, density estimates show possible multimodality 
of the data. The idea is to locally represent the data density by counting the number of 
observations in a sequence of consecutive intervals (bins) with origin 7p. Let B; (xo, h) denote 
the bin of length h which is the element of a bin grid starting at xo: 


B;(xo, h) = [xo oF (j _ 1)h, Xo + jh), i) = Z, 


where |.,.) denotes a left closed and right open interval. If {x;}"_, is an iid. sample with 
density f, the histogram is defined as follows: 


fal) = nh S~S~ Ta; € Bj (xo, h)}1{x € B;(2o, h)}- (1.7) 


jeZ i=1 


In sum (1.7) the first indicator function I{z; € B;(xo,h)} (see Symbols & Notation in 
Appendix A) counts the number of observations falling into bin B;(xo,h). The second 
indicator function is responsible for “localizing” the counts around x. The parameter h is a 
smoothing or localizing parameter and controls the width of the histogram bins. An h that 
is too large leads to very big blocks and thus to a very unstructured histogram. On the other 
hand, an h that is too small gives a very variable estimate with many unimportant peaks. 


The effect of h is given in detail in Figure 1.6. It contains the histogram (upper left) for the 
diagonal of the counterfeit bank notes for xo = 137.8 (the minimum of these observations) 
and h = 0.1. Increasing h to h = 0.2 and using the same origin, 7 = 137.8, results in 
the histogram shown in the lower left of the figure. This density histogram is somewhat 
smoother due to the larger h. The binwidth is next set to h = 0.3 (upper right). From this 
histogram, one has the impression that the distribution of the diagonal is bimodal with peaks 
at about 138.5 and 139.9. The detection of modes requires a fine tuning of the binwidth. 
Using methods from smoothing methodology (Hardle, Miiller, Sperlich and Werwatz, 2003) 
one can find an “optimal” binwidth h for n observations: 


24./n\*/3 
Page = is . 


Unfortunately, the binwidth h is not the only parameter determining the shapes of f. 
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Swiss bank notes 


Swiss bank notes 
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Figure 1.6. Diagonal of counterfeit bank notes. Histograms with rp = 
137.8 and h = 0.1 (upper left), h = 0.2 (lower left), h = 0.3 (upper right), 


h = 0.4 (lower right). @Q MVAhisbank1.xpl 


In Figure 1.7, we show histograms with x9 = 137.65 (upper left), 29 = 137.75 (lower left), 
with x9 = 137.85 (upper right), and x9 = 137.95 (lower right). All the graphs have been 
scaled equally on the y-axis to allow comparison. One sees that—despite the fixed binwidth 
h—the interpretation is not facilitated. The shift of the origin x (to 4 different locations) 
created 4 different histograms. This property of histograms strongly contradicts the goal 
of presenting data features. Obviously, the same data are represented quite differently by 
the 4 histograms. A remedy has been proposed by Scott (1985): “Average the shifted 
histograms!”. The result is presented in Figure 1.8. Here all bank note observations (genuine 
and counterfeit) have been used. The averaged shifted histogram is no longer dependent on 
the origin and shows a clear bimodality of the diagonals of the Swiss bank notes. 
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Swiss bank notes Swiss bank notes 
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Figure 1.7. Diagonal of counterfeit bank notes. Histogram with h = 0.4 
and origins %9 = 137.65 (upper left), 79 = 137.75 (lower left), xo = 137.85 


(upper right), x9 = 137.95 (lower right). Q MVAhisbank2.xpl 


ay 


Summary 


ed 


Modes of the density are detected with a histogram. 


ed 


Modes correspond to strong peaks in the histogram. 


ed 


Histograms with the same h need not be identical. They also depend on 
the origin 2 of the grid. 


The influence of the origin xo is drastic. Changing 29 creates different 
looking histograms. 


The consequence of an h that is too large is an unstructured histogram 
that is too flat. 


A binwidth h that is too small results in an unstable histogram. 
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Summary (continued) 
<> There is an “optimal” h = (24,/r/n)'/°. 


<+ It is recommended to use averaged histograms. They are kernel densities. 


1.3. Kernel Densities 
The major difficulties of histogram estimation may be summarized in four critiques: 


e determination of the binwidth h, which controls the shape of the histogram, 
e choice of the bin origin 29, which also influences to some extent the shape, 


e loss of information since observations are replaced by the central point of the interval 
in which they fall, 


e the underlying density function is often assumed to be smooth, but the histogram is 
not smooth. 


Rosenblatt (1956), Whittle (1958), and Parzen (1962) developed an approach which avoids 
the last three difficulties. First, a smooth kernel function rather than a box is used as the 
basic building block. Second, the smooth function is centered directly over each observation. 
Let us study this refinement by supposing that x is the center value of a bin. The histogram 
can in fact be rewritten as 


fr(v) =n"h" S$“ T (|x — | < * (1.8) 


i=1 


If we define K(u) = I(\u| < 3), then (1.8) changes to 


f(a) =r (=). (1.9) 


This is the general form of the kernel estimator. Allowing smoother kernel functions like the 
quartic kernel, 


K(u) = 20 -w) Hu <2), 


and computing x not only at bin centers gives us the kernel density estimator. Kernel 
estimators can also be derived via weighted averaging of rounded points (WARPing) or by 
averaging histograms with different origins, see Scott (1985). Table 1.5 introduces some 
commonly used kernels. 
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Swiss bank notes Swiss bank notes 


Swiss bank notes Swiss bank notes 
4 1 4 4 4 4 


Figure 1.8. Averaged shifted histograms based on all (counterfeit and gen- 
uine) Swiss bank notes: there are 2 shifts (upper left), 4 shifts (lower left), 
8 shifts (upper right), and 16 shifts (lower right). Q MVAashbank.xp1l 


K(e) Kernel 

K(u) = SI (\ul <1) Uniform 

K(u) = (1 — Jul)F (jul < 1) Triangle 

K(u) = 201 -w)I(\ul < 0) Epanechnikov 
K(u) = #01 — w?)?Z([u] < 1) Quartic (Biweight) 
K(u) = = exp(—“) = v(u) Gaussian 


Table 1.5. Kernel functions. 


Different kernels generate different shapes of the estimated density. The most important pa- 
rameter is the so-called bandwidth h, and can be optimized, for example, by cross-validation; 
see Hardle (1991) for details. The cross-validation method minimizes the integrated squared 


: 2 
error. This measure of discrepancy is based on the squared differences { fr(x) — f («)} 
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Swiss bank notes 


density estimates for diagonals 


138 139 140 141 142 
counterfeit / genuine 


Figure 1.9. Densities of the diagonals of genuine and counterfeit bank 
notes. Automatic density estimates. Q MVAdenbank.xp1l 


Averaging these squared deviations over a grid of points {x,}/_, leads to 


= = { faa) — f(a) } 


Asymptotically, if this grid size tends to zero, we obtain the integrated squared error: 


| finlo 7 fla) dr. 


In practice, it turns out that the method consists of selecting a bandwidth that minimizes 
the cross-validation function 7 
[R-2D fates 
i=1 


where fi is the density estimate obtained by using all datapoints except for the i-th obser- 
vation. Both terms in the above function involve double sums. Computation may therefore 


28 1 Comparison of Batches 


Figure 1.10. Contours of the density of X4 and X¢ of genuine and coun- 
terfeit bank notes. Q MVAcontbank2.xpl 


be slow. There are many other density bandwidth selection methods. Probably the fastest 
way to calculate this is to refer to some reasonable reference distribution. The idea of using 
the Normal distribution as a reference, for example, goes back to Silverman (1986). The 
resulting choice of h is called the rule of thumb. 


For the Gaussian kernel from Table 1.5 and a Normal reference distribution, the rule of 
thumb is to choose 
hg = 1.066n71/° (1.10) 


where G = \/n-!)~"_ | (a; — Z)? denotes the sample standard deviation. This choice of hg 
optimizes the integrated squared distance between the estimator and the true density. For 
the quartic kernel, we need to transform (1.10). The modified rule of thumb is: 


ho = 2:62 he. (1.11) 


Figure 1.9 shows the automatic density estimates for the diagonals of the counterfeit and 
genuine bank notes. The density on the left is the density corresponding to the diagonal 
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of the counterfeit data. The separation is clearly visible, but there is also an overlap. The 
problem of distinguishing between the counterfeit and genuine bank notes is not solved by 
just looking at the diagonals of the notes! The question arises whether a better separation 
could be achieved using not only the diagonals but one or two more variables of the data 
set. The estimation of higher dimensional densities is analogous to that of one-dimensional. 
We show a two dimensional density estimate for X, and X5 in Figure 1.10. The contour 
lines indicate the height of the density. One sees two separate distributions in this higher 
dimensional space, but they still overlap to some extent. 


\ > 
fess 


= 


Figure 1.11. Contours of the density of X4, X;, X66 of genuine and coun- 
terfeit bank notes. @ MVAcontbank3.xpl 


We can add one more dimension and give a graphical representation of a three dimensional 
density estimate, or more precisely an estimate of the joint distribution of X4, X5 and X¢. 
Figure 1.11 shows the contour areas at 3 different levels of the density: 0.2 (light grey), 0.4 
(grey), and 0.6 (black) of this three dimensional density estimate. One can clearly recognize 
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two “ellipsoids” (at each level), but as before, they overlap. In Chapter 12 we will learn 
how to separate the two ellipsoids and how to develop a discrimination rule to distinguish 
between these data points. 


iar Summary 


<+ Kernel densities estimate distribution densities by the kernel method. 


The bandwidth h determines the degree of smoothness of the estimate f. 


Kernel densities are smooth functions and they can graphically represent 
distributions (up to 3 dimensions). 


f 


<> A simple (but not necessarily correct) way to find a good bandwidth is to 
compute the rule of thumb bandwidth hg = 1.066n~'/°. This bandwidth 
is to be used only in combination with a Gaussian kernel yp. 


<+ Kernel density estimates are a good descriptive tool for seeing modes, 
location, skewness, tails, asymmetry, etc. 


1.4 Scatterplots 


Scatterplots are bivariate or trivariate plots of variables against each other. They help us 
understand relationships among the variables of a data set. A downward-sloping scatter 
indicates that as we increase the variable on the horizontal axis, the variable on the vertical 
axis decreases. An analogous statement can be made for upward-sloping scatters. 


Figure 1.12 plots the 5th column (upper inner frame) of the bank data against the 6th 
column (diagonal). The scatter is downward-sloping. As we already know from the previous 
section on marginal comparison (e.g., Figure 1.9) a good separation between genuine and 
counterfeit bank notes is visible for the diagonal variable. The sub-cloud in the upper half 
(circles) of Figure 1.12 corresponds to the true bank notes. As noted before, this separation 
is not distinct, since the two groups overlap somewhat. 


This can be verified in an interactive computing environment by showing the index and 
coordinates of certain points in this scatterplot. In Figure 1.12, the 70th observation in 
the merged data set is given as a thick circle, and it is from a genuine bank note. This 
observation lies well embedded in the cloud of counterfeit bank notes. One straightforward 
approach that could be used to tell the counterfeit from the genuine bank notes is to draw 
a straight line and define notes above this value as genuine. We would of course misclassify 
the 70th observation, but can we do better? 
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Swiss bank notes 


diagonal (X6) 


upper inner frame (X5) 


Figure 1.12. 2D scatterplot for X; vs. X¢ of the bank notes. Genuine 
notes are circles, counterfeit notes are stars. @Q MVAscabank56.xpl 


If we extend the two-dimensional scatterplot by adding a third variable, e.g., X4 (lower 
distance to inner frame), we obtain the scatterplot in three-dimensions as shown in Fig- 
ure 1.13. It becomes apparent from the location of the point clouds that a better separation 
is obtained. We have rotated the three dimensional data until this satisfactory 3D view 
was obtained. Later, we will see that rotation is the same as bundling a high-dimensional 
observation into one or more linear combinations of the elements of the observation vector. 
In other words, the “separation line” parallel to the horizontal coordinate axis in Figure 1.12 
is in Figure 1.13 a plane and no longer parallel to one of the axes. The formula for such a 
separation plane is a linear combination of the elements of the observation vector: 


0,21 + ag%g +... + agxXg = const. (1.12) 


The algorithm that automatically finds the weights (a1,...,a¢) will be investigated later on 
in Chapter 12. 


Let us study yet another technique: the scatterplot matrix. If we want to draw all possible 
two-dimensional scatterplots for the variables, we can create a so-called draftman’s plot 
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Swiss bank notes 


Figure 1.13. 3D Scatterplot of the bank notes for (X4, X5, X¢). Genuine 
notes are circles, counterfeit are stars. Q MVAscabank456.xpl 


(named after a draftman who prepares drafts for parliamentary discussions). Similar to a 
draftman’s plot the scatterplot matrix helps in creating new ideas and in building knowledge 
about dependencies and structure. 


Figure 1.14 shows a draftman plot applied to the last four columns of the full bank data 
set. For ease of interpretation we have distinguished between the group of counterfeit and 
genuine bank notes by a different color. As discussed several times before, the separability of 
the two types of notes is different for different scatterplots. Not only is it difficult to perform 
this separation on, say, scatterplot X3 vs. X4, in addition the “separation line” is no longer 
parallel to one of the axes. The most obvious separation happens in the scatterplot in the 
lower right where we show, as in Figure 1.12, X5 vs. Xg. The separation line here would be 
upward-sloping with an intercept at about X¢ = 139. The upper right half of the draftman 
plot shows the density contours that we have introduced in Section 1.3. 


The power of the draftman plot lies in its ability to show the the internal connections of the 
scatter diagrams. Define a brush as a re-scalable rectangle that we can move via keyboard 
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Figure 1.14. Draftman plot of the bank notes. The pictures in the left col- 
umn show (X3, X4), (X3, Xs) and (X3, X¢), in the middle we have (X4, Xs) 
and (X4, X¢), and in the lower right is (X5, X¢). The upper right half con- 
tains the corresponding density contour plots. Q MVAdrafbank4.xpl 


or mouse over the screen. Inside the brush we can highlight or color observations. Suppose 
the technique is installed in such a way that as we move the brush in one scatter, the 
corresponding observations in the other scatters are also highlighted. By moving the brush, 
we can study conditional dependence. 


If we brush (i.e., highlight or color the observation with the brush) the X; vs. X¢ plot 
and move through the upper point cloud, we see that in other plots (e.g., X3 vs. X4), the 
corresponding observations are more embedded in the other sub-cloud. 
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car Summary 


<< Scatterplots in two and three dimensions helps in identifying separated 
points, outliers or sub-clusters. 


{ 


Scatterplots help us in judging positive or negative dependencies. 


Draftman scatterplot matrices help detect structures conditioned on values 
of other variables. 


<> As the brush of a scatterplot matrix moves through a point cloud, we can 
study conditional dependence. 


1.5 Chernoff-Flury Faces 


If we are given data in numerical form, we tend to display it also numerically. This was 
done in the preceding sections: an observation x; = (1,2) was plotted as the point (1,2) ina 
two-dimensional coordinate system. In multivariate analysis we want to understand data in 
low dimensions (e.g., on a 2D computer screen) although the structures are hidden in high 
dimensions. The numerical display of data structures using coordinates therefore ends at 
dimensions greater than three. 


If we are interested in condensing a structure into 2D elements, we have to consider alter- 
native graphical techniques. The Chernoff-Flury faces, for example, provide such a conden- 
sation of high-dimensional information into a simple “face”. In fact faces are a simple way 
to graphically display high-dimensional data. The size of the face elements like pupils, eyes, 
upper and lower hair line, etc., are assigned to certain variables. The idea of using faces goes 
back to Chernoff (1973) and has been further developed by Bernhard Flury. We follow the 
design described in Flury and Riedwyl (1988) which uses the following characteristics. 


right eye size 

right pupil size 

position of right pupil 

right eye slant 

horizontal position of right eye 
vertical position of right eye 
curvature of right eyebrow 
density of right eyebrow 
horizontal position of right eyebrow 
vertical position of right eyebrow 
right upper hair line 


re 
FoOowoWooOonNnrouw»rwnrer 
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Observations 91 to 110 


Figure 1.15. Chernoff-Flury faces for observations 91 to 110 of the bank 
notes. Q MVAfacebank10.xpl 


12 
13 
14 
15 
16 
17 
18 
19-36 


right lower hair line 

right face line 

darkness of right hair 

right hair slant 

right nose line 

right size of mouth 

right curvature of mouth 

like 1-18, only for the left side. 


First, every variable that is to be coded into a characteristic face element is transformed 
into a (0,1) scale, i.e., the minimum of the variable corresponds to 0 and the maximum to 
1. The extreme positions of the face elements therefore correspond to a certain “grin” or 


“happy” face element. Dark hair might be coded as 1, and blond hair as 0 and so on. 
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Figure 1.16. 


Chernoff-Flury faces for observations 1 to 50 of the bank 


notes. @ MVAfacebank50.xpl 


As an example, consider the observations 91 to 110 of the bank data. Recall that the bank 
data set consists of 200 observations of dimension 6 where, for example, X6 is the diagonal 
of the note. If we assign the six variables to the following face elements 


X1 
X2 
Xs 
X4 
Xs 
X6 
we obtain Figure 1.15. 


and that observations 
notes then correspond 


= 1, 19 (eye sizes) 

= 2, 20 (pupil sizes) 

= 4, 22 (eye slants) 

11, 29 (upper hair lines) 

12, 30 (lower hair lines) 

= 18, 14, 31, 32 (face lines and darkness of hair), 


Also recall that observations 1-100 correspond to the genuine notes, 
101—200 correspond to the counterfeit notes. The counterfeit bank 
to the lower half of Figure 1.15. In fact the faces for these observations 


look more grim and less happy. The variable X¢ (diagonal) already worked well in the boxplot 
on Figure 1.4 in distinguishing between the counterfeit and genuine notes. Here, this variable 


is assigned to the face 


line and the darkness of the hair. That is why we clearly see a good 


separation within these 20 observations. 


What happens if we include all 100 genuine and all 100 counterfeit bank notes in the Chernoff- 
Flury face technique? Figures 1.16 and 1.17 show the faces of the genuine bank notes with the 
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Figure 1.17. Chernoff-Flury faces for observations 51 to 100 of the bank 
notes. Q MVAfacebank50.xpl 


same assignments as used before and Figures 1.18 and 1.19 show the faces of the counterfeit 
bank notes. Comparing Figure 1.16 and Figure 1.18 one clearly sees that the diagonal (face 
line) is longer for genuine bank notes. Equivalently coded is the hair darkness (diagonal) 
which is lighter (shorter) for the counterfeit bank notes. One sees that the faces of the 
genuine bank notes have a much darker appearance and have broader face lines. The faces 
in Figures 1.16-1.17 are obviously different from the ones in Figures 1.18-1.19. 


mn Summary 


Faces can be used to detect subgroups in multivariate data. 


cy 
<>» Subgroups are characterized by similar looking faces. 
— 


Outliers are identified by extreme faces, e.g., dark hair, smile or a happy 
face. 


If one element of X is unusual, the corresponding face element significantly 
changes in shape. 


[ 
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Observations 101 to 150 
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Figure 1.18. Chernoff-Flury faces for observations 101 to 150 of the bank 
notes. Q@ MVAfacebank50.xpl 


Observations 151 to 200 
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Figure 1.19. Chernoff-Flury faces for observations 151 to 200 of the bank 
notes. Q@ MVAfacebank50.xpl 
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1.6 Andrews’ Curves 


The basic problem of graphical displays of multivariate data is the dimensionality. Scat- 
terplots work well up to three dimensions (if we use interactive displays). More than three 
dimensions have to be coded into displayable 2D or 3D structures (e.g., faces). The idea 
of coding and representing multivariate data by curves was suggested by Andrews (1972). 
Each multivariate observation X; = (X;1,.., Xi) is transformed into a curve as follows: 


h(i) = “3 + X;2sin(t) + Xj,3 cos(t) + ... + Xip-1sin(4*t) + Xipcos("*t) for p odd 
7 + X;2sin(t) + X;,3 cos(é) +... + Xipsin(Sz) eae 


(1.13) 
such that the observation represents the coefficients of a so-called Fourier series (t € [—7, 7). 


Suppose that we have three-dimensional observations: X, = (0,0,1), X2 = (1,0,0) and 
X3 = (0,1,0). Here p = 3 and the following representations correspond to the Andrews’ 
curves: 


fi(t) =  cos(t) 


jolt) = a and 
f(t) 


sin(t). 
These curves are indeed quite distinct, since the observations X,, X2, and X3 are the 3D 
unit vectors: each observation has mass only in one of the three dimensions. The order of 
the variables plays an important role. 


EXAMPLE 1.2 Let us take the 96th observation of the Swiss bank note data set, 
Xog = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7). 


The Andrews’ curve is by (1.13): 


215.6 
fos(t) = —= + 129.9sin(t) + 129.9 cos(t) + 9.0 sin(2t) + 9.5 cos(2t) + 141.7 sin(3¢). 


V2 


Figure 1.20 shows the Andrews’ curves for observations 96-105 of the Swiss bank note data 
set. We already know that the observations 96-100 represent genuine bank notes, and that 
the observations 101-105 represent counterfeit bank notes. We see that at least four curves 
differ from the others, but it is hard to tell which curve belongs to which group. 


We know from Figure 1.4 that the sixth variable is an important one. Therefore, the An- 
drews’ curves are calculated again using a reversed order of the variables. 
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Andrews curves (Bank data) 


£96- £105 


Figure 1.20. Andrews’ curves of the observations 96-105 from the 
Swiss bank note data. The order of the variables is 1,2,3,4,5,6. 
Q MVAandcur.xpl 


EXAMPLE 1.3 Let us consider again the 96th observation of the Swiss bank note data set, 
Xog6 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7). 


The Andrews’ curve is computed using the reversed order of variables: 


141.7 
foo(t) = Va + 9.5sin(t) + 9.0 cos(t) + 129.9 sin(2t) + 129.9 cos(2t) + 215.6 sin(3t). 
In Figure 1.21 the curves fog—fios for observations 96-105 are plotted. Instead of a difference 
in high frequency, now we have a difference in the intercept, which makes it more difficult 
for us to see the differences in observations. 


This shows that the order of the variables plays an important role for the interpretation. If 
X is high-dimensional, then the last variables will have only a small visible contribution to 
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Andrews curves (Bank data) 


£96 - £105 


Figure 1.21. Andrews’ curves of the observations 96-105 from the 
Swiss bank note data. The order of the variables is 6,5,4,3,2,1. 
Q MVAandcur2.xpl 


the curve. They fall into the high frequency part of the curve. To overcome this problem 
Andrews suggested using an order which is suggested by Principal Component Analysis. 
This technique will be treated in detail in Chapter 9. In fact, the sixth variable will appear 
there as the most important variable for discriminating between the two groups. If the 
number of observations is more than 20, there may be too many curves in one graph. This 
will result in an over plotting of curves or a bad “signal-to-ink-ratio”, see Tufte (1983). It 
is therefore advisable to present multivariate observations via Andrews’ curves only for a 
limited number of observations. 


a Summary 


<+ QOutliers appear as single Andrews’ curves that look different from the rest. 
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Summary (continued) 


[ 


A subgroup of data is characterized by a set of simular curves. 


[ 


The order of the variables plays an important role for interpretation. 


[ 


Analysis. 


The order of variables may be optimized by Principal Component 


< For more than 20 observations we may obtain a bad “signal-to-ink-ratio” , 


i.e., too many curves are overlaid in one picture. 


1.7 Parallel Coordinates Plots 


Parallel coordinates plots (PCP) constitute a technique that is based on a non-Cartesian 


coordinate system and therefore allows one to “see” more than four dimensions. 


The idea 


Parallel coordinate plot (Bank data) 


£96 - £105 


Figure 1.22. Parallel 
Q MVAparcoo1.xpl 


coordinates plot of observations 


96-105. 
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Parallel coordinate plot (Bank data) 
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Figure 1.23. The entire bank data set. Genuine bank notes are dis- 
played as black lines. The counterfeit bank notes are shown as red lines. 
Q MVAparcoo2.xpl 


is simple: Instead of plotting observations in an orthogonal coordinate system, one draws 
their coordinates in a system of parallel axes. Index 7 of the coordinate is mapped onto the 
horizontal axis, and the value z;is mapped onto the vertical axis. This way of representation 
is very useful for high-dimensional data. It is however also sensitive to the order of the 
variables, since certain trends in the data can be shown more clearly in one ordering than in 
another. 


EXAMPLE 1.4 Take once again the observations 96-105 of the Swiss bank notes. These 
observations are six dimensional, so we can't show them in a six dimensional Cartesian 
coordinate system. Using the parallel coordinates plot technique, however, they can be plotted 
on parallel axes. This is shown in Figure 1.22. 


We have already noted in Example 1.2 that the diagonal X¢ plays an important role. This 
important role is clearly visible from Figure 1.22 The last coordinate X¢ shows two different 
subgroups. The full bank note data set is displayed in Figure 1.23. One sees an overlap of 
the coordinate values for indices 1-3 and an increased separability for the indices 4-6. 
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ar’ Summary 


<— Parallel coordinates plots overcome the visualization problem of the Carte- 
sian coordinate system for dimensions greater than 4. 


{ 


Outliers are visible as outlying polygon curves. 


The order of variables is still important, for example, for detection of 
subgroups. 


i 


<+ Subgroups may be screened by selective coloring in an interactive manner. 


1.8 Boston Housing 


Aim of the analysis 


The Boston Housing data set was analyzed by Harrison and Rubinfeld (1978) who wanted 
to find out whether “clean air” had an influence on house prices. We will use this data set in 
this chapter and in most of the following chapters to illustrate the presented methodology. 
The data are described in Appendix B.1. 


What can be seen from the PCPs 


In order to highlight the relations of X 4 to the remaining 13 variables we color all of the 
observations with Xy4 >median(Xj,4) as red lines in Figure 1.24. Some of the variables seem 
to be strongly related. The most obvious relation is the negative dependence between X13 
and Xj,4. It can also be argued that there exists a strong dependence between Xj and X14 
since no red lines are drawn in the lower part of Xj. The opposite can be said about Xj: 
there are only red lines plotted in the lower part of this variable. Low values of X4, induce 
high values of X44. 


For the PCP, the variables have been rescaled over the interval [0,1] for better graphical 
representations. The PCP shows that the variables are not distributed in a symmetric 
manner. It can be clearly seen that the values of X; and X»9 are much more concentrated 
around 0. Therefore it makes sense to consider transformations of the original data. 
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Boston Housing 


1.00 


Figure 1.24. Parallel coordinates plot for Boston Housing data. 
Q MVApcphousing.xpl 


The scatterplot matrix 


One characteristic of the PCPs is that many lines are drawn on top of each other. This 
problem is reduced by depicting the variables in pairs of scatterplots. Including all 14 
variables in one large scatterplot matrix is possible, but makes it hard to see anything from 
the plots. Therefore, for illustratory purposes we will analyze only one such matrix from a 
subset of the variables in Figure 1.25. On the basis of the PCP and the scatterplot matrix 
we would like to interpret each of the thirteen variables and their eventual relation to the 
14th variable. Included in the figure are images for X;—Xs5 and Xj4, although each variable 
is discussed in detail below. All references made to scatterplots in the following refer to 
Figure 1.25. 
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Figure 1.25. Scatterplot matrix for variables X,,...,X5 and X44 of the 
Boston Housing data. Q MVAdrafthousing.xpl 


Per-capita crime rate X, 


Taking the logarithm makes the variable’s distribution more symmetric. This can be seen 
in the boxplot of X, in Figure 1.27 which shows that the median and the mean have moved 
closer to each other than they were for the original X;. Plotting the kernel density esti- 
mate (KDE) of X; = log (X,) would reveal that two subgroups might exist with different 
mean values. However, taking a look at the scatterplots in Figure 1.26 of the logarithms 
which include X, does not clearly reveal such groups. Given that the scatterplot of log (X,) 
vs. log (X14) shows a relatively strong negative relation, it might be the case that the two 
subgroups of X, correspond to houses with two different price levels. This is confirmed by 
the two boxplots shown to the right of the X, vs. X2 scatterplot (in Figure 1.25): the red 
boxplot’s shape differs a lot from the black one’s, having a much higher median and mean. 
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Figure 1.26. Scatterplot matrix for variables Xi; sides we and X44 of the 
Boston Housing data. Q MVAdrafthousingt.xpl 


Proportion of residential area zoned for large lots X» 


It strikes the eye in Figure 1.25 that there is a large cluster of observations for which X92 is 
equal to 0. It also strikes the eye that—as the scatterplot of X, vs. X2 shows—there is a 
strong, though non-linear, negative relation between X, and X_: Almost all observations for 
which X» is high have an X,-value close to zero, and vice versa, many observations for which 
Xy is zero have quite a high per-capita crime rate X,. This could be due to the location of 
the areas, e.g., downtown districts might have a higher crime rate and at the same time it 
is unlikely that any residential land would be zoned in a generous manner. 


As far as the house prices are concerned it can be said that there seems to be no clear (linear) 
relation between Xj and Xj4, but it is obvious that the more expensive houses are situated 
in areas where X92 is large (this can be seen from the two boxplots on the second position of 
the diagonal, where the red one has a clearly higher mean/median than the black one). 
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Proportion of non-retail business acres X; 


The PCP (in Figure 1.24) as well as the scatterplot of X3 vs. X,4 shows an obvious negative 
relation between X3 and Xj4. The relationship between the logarithms of both variables 
seems to be almost linear. This negative relation might be explained by the fact that non- 
retail business sometimes causes annoying sounds and other pollution. Therefore, it seems 
reasonable to use _X3 as an explanatory variable for the prediction of Xj, in a linear-regression 
analysis. 


As far as the distribution of X3 is concerned it can be said that the kernel density estimate 
of X3 clearly has two peaks, which indicates that there are two subgroups. According to the 
negative relation between X3 and Xj, it could be the case that one subgroup corresponds to 
the more expensive houses and the other one to the cheaper houses. 


Charles River dummy variable X, 


The observation made from the PCP that there are more expensive houses than cheap 
houses situated on the banks of the Charles River is confirmed by inspecting the scatterplot 
matrix. Still, we might have some doubt that the proximity to the river influences the house 
prices. Looking at the original data set, it becomes clear that the observations for which 
X4 equals one are districts that are close to each other. Apparently, the Charles River does 
not flow through too many different districts. Thus, it may be pure coincidence that the 
more expensive districts are close to the Charles River—their high values might be caused by 
many other factors such as the pupil/teacher ratio or the proportion of non-retail business 
acres. 


Nitric oxides concentration X; 


The scatterplot of X5 vs. X 4 and the separate boxplots of X5 for more and less expensive 
houses reveal a clear negative relation between the two variables. As it was the main aim of 
the authors of the original study to determine whether pollution had an influence on housing 
prices, it should be considered very carefully whether X5 can serve as an explanatory variable 
for the price X,4. A possible reason against it being an explanatory variable is that people 
might not like to live in areas where the emissions of nitric oxides are high. Nitric oxides are 
emitted mainly by automobiles, by factories and from heating private homes. However, as 
one can imagine there are many good reasons besides nitric oxides not to live downtown or in 
industrial areas! Noise pollution, for example, might be a much better explanatory variable 
for the price of housing units. As the emission of nitric oxides is usually accompanied by 
noise pollution, using X; as an explanatory variable for X,4 might lead to the false conclusion 
that people run away from nitric oxides, whereas in reality it is noise pollution that they are 
trying to escape. 
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Average number of rooms per dwelling X¢ 


The number of rooms per dwelling is a possible measure for the size of the houses. Thus we 
expect X¢ to be strongly correlated with X14 (the houses’ median price). Indeed—apart from 
some outliers—the scatterplot of X¢ vs. X 4 shows a point cloud which is clearly upward- 
sloping and which seems to be a realisation of a linear dependence of X14 on X¢. The two 
boxplots of X¢ confirm this notion by showing that the quartiles, the mean and the median 
are all much higher for the red than for the black boxplot. 


Proportion of owner-occupied units built prior to 1940 X, 


There is no clear connection visible between X7 and Xj4. There could be a weak negative 
correlation between the two variables, since the (red) boxplot of X7 for the districts whose 
price is above the median price indicates a lower mean and median than the (black) boxplot 
for the district whose price is below the median price. The fact that the correlation is not 
so clear could be explained by two opposing effects. On the one hand house prices should 
decrease if the older houses are not in a good shape. On the other hand prices could increase, 
because people often like older houses better than newer houses, preferring their atmosphere 
of space and tradition. Nevertheless, it seems reasonable that the houses’ age has an influence 
on their price X44. 


Raising X7 to the power of 2.5 reveals again that the data set might consist of two subgroups. 
But in this case it is not obvious that the subgroups correspond to more expensive or cheaper 
houses. One can furthermore observe a negative relation between X7 and Xg. This could 
reflect the way the Boston metropolitan area developed over time: the districts with the 
newer buildings are farther away from employment centres with industrial facilities. 


Weighted distance to five Boston employment centres Xx 


Since most people like to live close to their place of work, we expect a negative relation 
between the distances to the employment centres and the houses’ price. The scatterplot 
hardly reveals any dependence, but the boxplots of Xg indicate that there might be a slightly 
positive relation as the red boxplot’s median and mean are higher than the black one’s. 
Again, there might be two effects in opposite directions at work. The first is that living 
too close to an employment centre might not provide enough shelter from the pollution 
created there. The second, as mentioned above, is that people do not travel very far to their 
workplace. 
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Index of accessibility to radial highways X95 


The first obvious thing one can observe in the scatterplots, as well in the histograms and the 
kernel density estimates, is that there are two subgroups of districts containing X9 values 
which are close to the respective group’s mean. The scatterplots deliver no hint as to what 
might explain the occurrence of these two subgroups. The boxplots indicate that for the 
cheaper and for the more expensive houses the average of Xg is almost the same. 


Full-value property tax Xj 


Xj 9 shows a behavior similar to that of X9: two subgroups exist. A downward-sloping curve 
seems to underlie the relation of X 9 and X44. This is confirmed by the two boxplots drawn 
for Xj9: the red one has a lower mean and median than the black one. 


Pupil/teacher ratio X\, 


The red and black boxplots of Xj, indicate a negative relation between X,,; and X14. This 
is confirmed by inspection of the scatterplot of Xy, vs. Xy4: The point cloud is downward 
sloping, i.e., the less teachers there are per pupil, the less people pay on median for their 
dwellings. 


Proportion of blacks B, Xj = 1000(B — 0.63)?I(B < 0.63) 


Interestingly, X 2 is negatively—though not linearly—correlated with X3, X7 and Xj, 
whereas it is positively related with X,4. Having a look at the data set reveals that for 
almost all districts Xj. takes on a value around 390. Since B cannot be larger than 0.63, 
such values can only be caused by B close to zero. Therefore, the higher Xj. is, the lower 
the actual proportion of blacks is! Among observations 405 through 470 there are quite a 
few that have a Xj. that is much lower than 390. This means that in these districts the 
proportion of blacks is above zero. We can observe two clusters of points in the scatterplots 
of log (Xj2): one cluster for which Xj2 is close to 390 and a second one for which Xj, is 
between 3 and 100. When Xj. is positively related with another variable, the actual pro- 
portion of blacks is negatively correlated with this variable and vice versa. This means that 
blacks live in areas where there is a high proportion of non-retail business acres, where there 
are older houses and where there is a high (i.e., bad) pupil/teacher ratio. It can be observed 
that districts with housing prices above the median can only be found where the proportion 
of blacks is virtually zero! 


1.8 Boston Housing 51 


Proportion of lower status of the population X,; 


Of all the variables X13 exhibits the clearest negative relation with X 4—hardly any outliers 
show up. Taking the square root of X13 and the logarithm of Xj, transforms the relation 
into a linear one. 


Transformations 


Since most of the variables exhibit an asymmetry with a higher density on the left side, the 
following transformations are proposed: 


X, = log (X;) 


Xp = X2/10 
X3 = log (X3) 
X4 none, since X4 is binary 


X53 = log (Xs) 

X—¢ = log (Xe) 

X,; = X,*°/10000 

Xg = log (Xs) 

Xg = log (Xo) 

Xo = log (Xo) 

Xi, = exp(0.4 x X1,)/1000 
Xi = X2/100 

X13 = X13 

Xi = log (Xi) 


Taking the logarithm or raising the variables to the power of something smaller than one helps 
to reduce the asymmetry. This is due to the fact that lower values move further away from 
each other, whereas the distance between greater values is reduced by these transformations. 


Figure 1.27 displays boxplots for the original mean variance scaled variables as well as for the 
proposed transformed variables. The transformed variables’ boxplots are more symmetric 
and have less outliers than the original variables’ boxplots. 
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Transformed Boston Housing data 


Figure 1.27. Boxplots for all of the variables from the Boston Housing 
data before and after the proposed transformations. Q MVAboxbhd.xp1l 


1.9 Exercises 


EXERCISE 1.1 Js the upper extreme always an outlier? 


EXERCISE 1.2 Js it possible for the mean or the median to lie outside of the fourths or 
even outside of the outside bars? 


EXERCISE 1.3 Assume that the data are normally distributed N(0,1). What percentage of 
the data do you expect to lie outside the outside bars? 


EXERCISE 1.4 What percentage of the data do you expect to lie outside the outside bars if 
we assume that the data are normally distributed N(0,07) with unknown variance 0? ? 
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EXERCISE 1.5 How would the five-number summary of the 15 largest U.S. cities differ from 
that of the 50 largest U.S. cities? How would the five-number summary of 15 observations 
of N(0,1)-distributed data differ from that of 50 observations from the same distribution? 


EXERCISE 1.6 Js it possible that all five numbers of the five-number summary could be 
equal? If so, under what conditions? 


EXERCISE 1.7 Suppose we have 50 observations of X ~ N(0,1) and another 50 observa- 
tions of Y ~ N(2,1). What would the 100 Flury faces look like if you had defined as face 
elements the face line and the darkness of hair? Do you expect any similar faces? How many 
faces do you think should look like observations of Y even though they are X observations? 


EXERCISE 1.8 Draw a histogram for the mileage variable of the car data (Table B.3). Do 
the same for the three groups (U.S., Japan, Europe). Do you obtain a similar conclusion as 
in the parallel boxplot on Figure 1.3 for these data? 


EXERCISE 1.9 Use some bandwidth selection criterion to calculate the optimally chosen 
bandwidth h for the diagonal variable of the bank notes. Would it be better to have one 
bandwidth for the two groups? 


EXERCISE 1.10 In Figure 1.9 the densities overlap in the region of diagonal = 140.4. We 
partially observed this in the boxplot of Figure 1.4. Our aim is to separate the two groups. 
Will we be able to do this effectively on the basis of this diagonal variable alone? 


EXERCISE 1.11 Draw a parallel coordinates plot for the car data. 


EXERCISE 1.12 How would you identify discrete variables (variables with only a limited 
number of possible outcomes) on a parallel coordinates plot? 


EXERCISE 1.13 True or false: the height of the bars of a histogram are equal to the relative 
frequency with which observations fall into the respective bins. 


EXERCISE 1.14 True or false: kernel density estimates must always take on a value between 
0 and 1. (Hint: Which quantity connected with the density function has to be equal to 1? 
Does this property imply that the density function has to always be less than 1?) 


EXERCISE 1.15 Let the following data set represent the heights of 13 students taking the 
Applied Multivariate Statistical Analysis course: 


1.72, 1.83, 1.74, 1.79, 1.94, 1.81, 1.66, 1.60, 1.78, 1.77, 1.85, 1.70, 1.76. 
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1. Find the corresponding five-number summary. 
2. Construct the boxplot. 


3. Draw a histogram for this data set. 


EXERCISE 1.16 Describe the unemployment data (see Table B.19) that contain unemploy- 
ment rates of all German Federal States using various descriptive techniques. 


EXERCISE 1.17 Using yearly population data (see B.20), generate 


1. a boxplot (choose one of variables) 
2. an Andrew’s Curve (choose ten data points) 
3. a scatterplot 


4. a histogram (choose one of the variables) 


What do these graphs tell you about the data and their structure? 


EXERCISE 1.18 Make a draftman plot for the car data with the variables 


AY = Price, 
X_2 = mileage, 
Xg = weight, 
Xg = length. 


Move the brush into the region of heavy cars. What can you say about price, mileage and 
length? Move the brush onto high fuel economy. Mark the Japanese, European and U.S. 
American cars. You should find the same condition as in boxplot Figure 1.2. 


EXERCISE 1.19 What is the form of a scatterplot of two independent random variables X, 
and X_ with standard Normal distribution? 


EXERCISE 1.20 Rotate a three-dimensional standard normal point cloud in 8D space. Does 
it “almost look the same from all sides”? Can you explain why or why not? 


Part II 


Multivariate Random Variables 


2 A Short Excursion into Matrix Algebra 


This chapter is a reminder of basic concepts of matrix algebra, which are particularly useful 
in multivariate analysis. It also introduces the notations used in this book for vectors and 
matrices. Eigenvalues and eigenvectors play an important role in multivariate techniques. 
In Sections 2.2 and 2.3, we present the spectral decomposition of matrices and consider the 
maximization (minimization) of quadratic forms given some constraints. 


In analyzing the multivariate normal distribution, partitioned matrices appear naturally. 
Some of the basic algebraic properties are given in Section 2.5. These properties will be 
heavily used in Chapters 4 and 5. 


The geometry of the multinormal and the geometric interpretation of the multivariate tech- 
niques (Part III) intensively uses the notion of angles between two vectors, the projection 
of a point on a vector and the distances between two points. These ideas are introduced in 
Section 2.6. 


2.1 Elementary Operations 


A matrix A is a system of numbers with n rows and p columns: 


Gy, AYQ vee eee eee Alp 
a22 
Qn1 An2 +++ «++ +++ Anp 


We also write (a;;) for A and A(n x p) to indicate the numbers of rows and columns. Vectors 
are matrices with one column and are denoted as x or x(p x 1). Special matrices and vectors 
are defined in Table 2.1. Note that we use small letters for scalars as well as for vectors. 
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Matrix Operations 


Elementary operations are summarized below: 


A" = (aj) 
A + B = (ai; + bi;) 
A=B= lag— 6) 
CoA. = fe-ay) 


AB = Anx p) Byxm) =a xm = (Soeums] : 


Properties of Matrix Operations 


A+B = B+A 


A(B+C) = AB+AC 
A(BC) = (AB)C 
(AT = A 
(AB)’ = BTAT 


Matrix Characteristics 


Rank 

The rank, rank(A), of a matrix A(n x p) is defined as the maximum number of linearly 
independent rows (columns). A set of k rows a; of A(n xp) are said to be linearly independent 
if De c;a; = 0» implies c; = 0,Vj, where c),...,¢, are scalars. In other words no rows in 
this set can be expressed as a linear combination of the (& — 1) remaining rows. 


Trace 


The trace of a matrix is the sum of its diagonal elements 


Pp 
i Ave x ig. 
<1 
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Name Definition Notation Example 
scalar p=n=l1 a 3 
1 
column vector p=l a ( 3 ) 
row vector n=1 at ( 1 3 
a 1 
vector of ones (date asl) ln ( ) 
a 1 
n 
_ 0 
vector of zeros (O50. 0) On ( ) 
—S 0 
n 
; 2 0 
square matrix n=p A(p x p) 0 2 
: : $2 . 1 0 
diagonal matrix aij =0,1 AJ, n=p diag(aii) 02 
: P P : 1 0 
identity matrix diag(1,...,1) Le 
KS 0 1 
p 
7 a ies 
unit matrix aj =1l,n=p Inly 11 
: : 1 2 
symmetric matrix Aig = Axi ( 2 3 ) 
: 0 0 
null matrix aij = 0 0) ( 0 0 ) 
12 4 
upper triangular matrix aj; = 0,7 < j 0 1 3 
0 0 1 
1 0 0 
idempotent matrix AA=A 03 3 
03 3 
ls, = 
orthogonal matrix MASE AA ( yy 
V2 2 


Table 2.1. Special matrices and vectors. 


Determinant 


The determinant is an important concept of matrix algebra. For a square matrix A, it is 
defined as: 
det(A) = |A| = S°(-1)"" array. @pr@)s 


the summation is over all permutations 7 of {1,2,...,p}, and |7| = 0 if the permutation can 
be written as a product of an even number of transpositions and |7| = 1 otherwise. 
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EXAMPLE 2.1 In the case of p= 2, A= ( ee ) and we can permute the digits “1” 


a21 422 
and “2” once or not at all. So, 


|A| = 11 422 — Aj2Q AQ). 


Transpose 
For A(n x p) and B(p x n) 


(A')' =A, and (AB)' = B'A'. 


Inverse 
If |A| 4 0 and A(p x p), then the inverse A~! exists: 
AACS A AST, 


For small matrices, the inverse of A = (a;;) can be calculated as 


where C = (c;;) is the adjoint matrix of A. The elements c;; of C' are the co-factors of A: 


ay sas @14(j-1) @1(j+1) pera Q1p 
cy = (—1)*7 OGY eer OGG 1) NGaD OTE: bes Gap 
’ Qepar -++ Berna) AeryG+y +++ Ftp 
Opt ate pt) Uppy -te, - App 


G-inverse 


A more general concept is the G-inverse (Generalized Inverse) A~ which satisfies the follow- 
ing: 
AA A=A. 


Later we will see that there may be more than one G-inverse. 
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EXAMPLE 2.2 The generalized inverse can also be calculated for singular matrices. We 


have: 
1 0 1 0 LO). f 1 0 
0 0 0 0 oy 2 he? 
i | ee ree 1 Q 
which means that the generalized inverse of A = ( 0 0 ) is AT = ( 0 0 ) even though 


the inverse matrix of A does not exist in this case. 


Eigenvalues, Eigenvectors 


Consider a (p x p) matrix A. If there exists a scalar \ and a vector y such that 


Ay =», (2.1) 
then we call 
an eigenvalue 
Y an eigenvector. 


It can be proven that an eigenvalue A is a root of the p-th order polynomial |.A — AJ,| = 0. 
Therefore, there are up to p eigenvalues Aj, A2,...,Ap of A. For each eigenvalue \;, there 
exists a corresponding eigenvector y; given by equation (2.1) . Suppose the matrix A has 
the eigenvalues 4;,...,Ap. Let A = diag(Aj,..., Ap). 


The determinant |.A| and the trace tr(A) can be rewritten in terms of the eigenvalues: 


4) = il=[Dy 2.2) 
tr(A) = tr(A) ae (2.3) 


An idempotent matrix A (see the definition in Table 2.1) can only have eigenvalues in {0, 1} 
therefore tr(A) = rank(A) = number of eigenvalues ¥ 0. 


EXAMPLE 2.3 Let us consider the matrix A = It is easy to verify that 


oo eK 
NleNwle @ 
NIFRDIE @ 


AA = A which implies that the matrix A is idempotent. 


We know that the eigenvalues of an idempotent matrix are equal to 0 or 1. In this case, the 


1 0 0 1 1 
eigenvalues of A are \y = 1, Ax = 1, and »3 = 0 since | 0 + 3 OC}, =1) 0 i, 
05 5 0 0 
2 9 

1 0 0 0 0 1 0 0 0 0 
os) (F=f 2) ma(oss)( e)-of ¥ 
O42 v2 v2 G; 2:2 _ v2 _v2 
2 2 2 2 2 2 9 2 
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Using formulas (2.2) and (2.3), we can calculate the trace and the determinant of A from 
the eigenvalues: tr(A) = A, + A2 + A3 = 2, |A| = AtA2A3 = 0, and rank(A) = 2. 


Properties of Matrix Characteristics 


A(n xn), B(nxn), cE R 


tr(A+8) = trA+trB (2.4) 
tr(cA) = ctrA (2.5) 
lcA| = c"|Al (2.6) 
JAB] = |BA| = |Al|5| (2.7) 
A(n x p), B(p x n) 
tr(A-B) = tr(B- A) (2.8) 
rank(A) < min(n,p) 
rank(A) > 0 (2.9) 
rank(A) = rank(A‘) (2.10) 
rank(A'A) = rank(A) (2.11) 
rank(A+B) <_ rank(A) + rank(B) (2.12) 
rank(AB) < min{rank(A), rank(B)} (2.13) 
A(n x p), Blip x q), C(q x n) 
ir(ABC) = tr(BCA) 
= (CAB) (2.14) 
rank(ABC) = rank(B8) for nonsingular A,C (2.15) 
A(p x p) 
JA} = |Al* (2.16) 
rank(A) = p_ if and only if A is nonsingular. (2st) 


rm: Summary 


<+ The determinant |.A| is the product of the eigenvalues of A. 


<>» The inverse of a matrix A exists if |A| 4 0. 
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Summary (continued) 


{ 


The trace tr(.A) is the sum of the eigenvalues of A. 


' 


The sum of the traces of two matrices equals the trace of the sum of the 
two matrices. 


The trace tr(AB) equals tr(B.A). 


The rank(A) is the maximal number of linearly independent rows 
(columns) of A. 


i 


[ 


2.2 Spectral Decompositions 


The computation of eigenvalues and eigenvectors is an important issue in the analysis of 
matrices. The spectral decomposition or Jordan decomposition links the structure of a 
matrix to the eigenvalues and the eigenvectors. 


THEOREM 2.1 (Jordan Decomposition) Each symmetric matriz A(p x p) can be written 


as 
Pp 
A=TATT=S°d,7) (2.18) 
j=l 
where 
A = diag(A1,..-,Ap) 
and where 


P= (yarn) 


is an orthogonal matrix consisting of the eigenvectors y, of A. 


EXAMPLE 2.4 Suppose that A= é The eigenvalues are found by solving |A—AZ| = 0. 
This 1s equivalent to 


ee 2 


ae [== a@-a)-4=0 


Hence, the eigenvalues are Ay = 2+ V5 and 3» = 2—VJ5. The eigenvectors are y, = 
(0.5257, 0.8506)" and y, = (0.8506, —0.5257)'. They are orthogonal since y/ 72 = 0. 


Using spectral decomposition, we can define powers of a matrix A(p x p). Suppose A is a 
symmetric matrix. Then by Theorem 2.1 


A=Tar’, 
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and we define for some a € R 


ater", (2.19) 
where A° = diag(Af,..., 9). In particular, we can easily calculate the inverse of the matrix 
A. Suppose that the eigenvalues of A are positive. Then with a = —1, we obtain the inverse 
of A from 

Ar srTi 1", (2.20) 


Another interesting decomposition which is later used is given in the following theorem. 


THEOREM 2.2 (Singular Value Decomposition) Each matrix A(n x p) with rank r can 
be decomposed as 

A=TAAI, 
where (nx r) and A(pxr). BothT and A are column orthonormal, i.e., 0'T = A'A = T, 
and A = diag Ox, ere ww), Aj > 0. The values A1,...,Ar are the non-zero eigenvalues of 


the matrices AA! and A'A. T and A consist of the corresponding r eigenvectors of these 
matrices. 


This is obviously a generalization of Theorem 2.1 (Jordan decomposition). With Theorem 
2.2, we can find a G-inverse A~ of A. Indeed, define A~ = A A“! T'. Then A AW A= 
TA A! = A. Note that the G-inverse is not unique. 


EXAMPLE 2.5 In Example 2.2, we showed that the generalized inverse of A = ( : ; ) 


is AT ( a ) The following also holds 


(on) (os) (a 0)= Coo) 


which means that the matrix ( : : ) is also a generalized inverse of A. 


ar Summary 


<+ The Jordan decomposition gives a representation of a symmetric matrix 
in terms of eigenvalues and eigenvectors. 
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Summary (continued) 


<< The eigenvectors belonging to the largest eigenvalues indicate the “main 
direction” of the data. 


<+ The Jordan decomposition allows one to easily compute the power of a 
symmetric matrix A: AX =TAT’. 


<+ The singular value decomposition (SVD) is a generalization of the Jordan 
decomposition to non-quadratic matrices. 


2.3. Quadratic Forms 


A quadratic form Q(z) is built from a symmetric matrix A(p x p) and a vector x € R?: 


OG) = xe’ Ar= S- So aire. (2.21) 


i=1 j=l 


Definiteness of Quadratic Forms and Matrices 


O(a) >Utoralle 40 positive definite 
O(a) = 0 for alle 70 positive semidefinite 


A matrix A is called positive definite (semidefinite) if the corresponding quadratic form Q(.) 


is positive definite (semidefinite). We write A > 0 (> 0). 


Quadratic forms can always be diagonalized, as the following result shows. 


THEOREM 2.3 /f A is symmetric and Q(x) = x! Ax is the corresponding quadratic form, 
then there exists a transformation x > T'a = y such that 


Pp 
of ALS > dy?, 
i=1 
where ; are the eigenvalues of A. 
Proof: 


A=T ATI’. By Theorem 2.1 and y = I''a we have that x’ Av = « TAD x = y' Ay = 
P d; 2 
ia1 “YG - 


Positive definiteness of quadratic forms can be deduced from positive eigenvalues. 
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THEOREM 2.4 A> 0 @f and only if all \; > 0,7 =1,...,p. 


Proof: 
O0<Ayi+---+ rvs =x! Ax for all x 40 by Theorem 2.3. Oo 


COROLLARY 2.1 /f A> 0, then A“! exists and |A| > 0. 


EXAMPLE 2.6 The quadratic form Q(x) = «j+23 corresponds to the matric A = (}) with 
eigenvalues \1 = Ay = 1 and is thus positive definite. The quadratic form Q(x) = (x1 — 22)? 
corresponds to the matriz A = (5 +) with eigenvalues Ay = 2,A2 = 0 and is positive 
semidefinite. The quadratic form Q(x) = x? — x3 with eigenvalues \, = 1,2 = —1 is 
indefinite. 


In the statistical analysis of multivariate data, we are interested in maximizing quadratic 
forms given some constraints. 


THEOREM 2.5 /f A and B are symmetric and B > 0, then the maximum of x! Ax under 
the constraints x' Bx = 1 is given by the largest eigenvalue of B~!1A. More generally, 


max g! Ag = dy > Ag See SAp= min «2! Az, 
{x:x! Br=1} {x:x' Br=1} 
where r1,...,A» denote the eigenvalues of B~'A. The vector which maximizes (minimizes) 


x! Ax under the constraint x'Bx = 1 is the eigenvector of B~'A which corresponds to the 
largest (smallest) eigenvalue of B~\A. 


Proof: 
By definition, B'/? = Tg AS Tg. Set y = Bx, then 
max 2! Ar= max y'BOV? AB-Y?y, (2.22) 
{x:x! Br=1} {y:y! y=1} 


From Theorem 2.1, let 
Ber AL Var Ar 


be the spectral decomposition of B-'/? A B-'/2. Set 


z=Tly > 2’zga=y'TT' y=y'y. 


Thus (2.22) is equivalent to 


max z' Az= max NZ. 
{z:z' z=1} {z:z' z=1} ¢ 
‘= 
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But 
max S- yz? < A, max Ss" go = Ny: 
z z : 
The maximum is thus obtained by z = (1,0,...,0)', ie., 


yay era By, 


Since B-1A and B~!/? A B-'/? have the same eigenvalues, the proof is complete. 


EXAMPLE 2.7 Consider the following matrices 


1 2 1 0 
er and Cea. 
t¢. 04 1 2 
BtA=(9 5 )- 


The biggest eigenvalue of the matric B-'A is 2+ .V/5. This means that the maximum of 
x! Ax under the constraint x' Bx =1 is 2+ V5. 


We calculate 


Notice that the constraint x'Bx = 1 corresponds, with our choice of B, to the points which 
lie on the unit circle x? + 23 =1. 


iar" Summary 


<> A quadratic form can be described by a symmetric matrix A. 


{ 


Quadratic forms can always be diagonalized. 


' 


Positive definiteness of a quadratic form is equivalent to positiveness of 
the eigenvalues of the matrix A. 


<— The maximum and minimum of a quadratic form given some constraints 
can be expressed in terms of eigenvalues. 
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2.4 Derivatives 


For later sections of this book, it will be useful to introduce matrix notation for derivatives 
of a scalar function of a vector x with respect to x. Consider f : R’ — R and a (p x 1) vector 


x, then OF) is the column vector of partial derivatives { 2s} = hae Bol. of) is the 


zr Oa5 
row vector of the same derivative (26) is called the gradient of f). 


3? f(a) 


We can also introduce second order derivatives: 55+ is the (p x p) matrix of elements 
sie Shin a and f= Des. M (Fie) is called the Hessian of f). 


Suppose that a is a (p x 1) vector and that A = A' is a (p x p) matrix. Then 


dala Ox'a 
nn en 
T 
aa = 2Ar. (2.24) 
The Hessian of the quadratic form Q(x) = x! Az is: 
Oa Ax 


EXAMPLE 2.8 Consider the matrix 


1 2 
Az ( i ) | 
From formulas (2.24) and (2.25) it immediately follows that the gradient of Q(x) = «' Az 
18 


Ox' Ax i 2 Qx 4x 
Ox =2Ar=2( 5 eee i) 


02x! Ax 1 a dl 
Ox0a! =2A=2( 5 = (4 a) 


2.5 Partitioned Matrices 


and the Hessian is 


Very often we will have to consider certain groups of rows and columns of a matrix A(n x p). 
In the case of two groups, we have 
An Ap 
A= 
( An Are 


where Aj;(n; X pj), 1,9 =1,2, nr. +n2g =n and pi + po =p. 
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If B(n x p) is partitioned accordingly, we have: 


Au oe By Ai oe Big 
A+B = 
es ( Aoi + Bo, Az. + Boo ) 
+ Big Bi 
B — 11 21 ) 
( By, By 

apy = ( AnBi + ArBig AnBs + AiBry 

Axi By, + A22Bi, A2iBy; + A22B, ) 


An important particular case is the square matrix A(p x p), partitioned such that A,, and 
Az. are both square matrices (i.e, nj; = p;,j = 1,2). It can be verified that when A is 
non-singular (AA7! = Z,): 


7 All Al 
AS ( "42l 42 ) (2.26) 
where - 
AM c= (An ApAs Any * = (Ans) 
ae —(Air2) AAs 
At = —Ajy Aoi (Ai2) 


A? = Ass + Ay Ani (Aire) Av Ag 


An alternative expression can be obtained by reversing the positions of A,, and Ag in the 
original matrix. 


The following results will be useful if A,, is non-singular: 
|A] = |Aui||A22 — An Aj Ai| = |Ais||Ao2.1]. (2.27) 
If Ago is non-singular, we have that: 


|A| = |A22||Aur — AizAgy Aai| = |A2a||Aral- (2.28) 


A useful formula is derived from the alternative expressions for the inverse and the determi- 


nant. For instance let 
n_{! b! 
ar te ee 


where a and b are (p x 1) vectors and A is non-singular. We then have: 
|B] =|A—ab"|=|Al|l1-b' Ata! (2.29) 
and equating the two expressions for B??, we obtain the following: 


A-tab' A7} 
1-—b' Aa 


(A=ab") =A 4 (2.30) 
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EXAMPLE 2.9 Let’s consider the matrix 


1 2 
a=(42) 
We can use formula (2.26) to calculate the inverse of a partitioned matrix, i.e., Al = 
—1,A¥ = A” = 1, A” = —-1/2. The inverse of A is 


—1 1 
= 
( 1 -05 ) 
It is also easy to calculate the determinant of A: 
JA] = [1]]2 — 4] = -2. 
Let A(n x p) and B(p x n) be any two matrices and suppose that n > p. From (2.27) 
and (2.28) we can conclude that 


=i, =A 
BT, 


=(—))**(BA—dG) = |AB= Az). (2.31) 


Since both determinants on the right-hand side of (2.31) are polynomials in 4, we find that 
the n eigenvalues of AB yield the p eigenvalues of BA plus the eigenvalue 0, n — p times. 
The relationship between the eigenvectors is described in the next theorem. 
THEOREM 2.6 For A(n x p) and B(p x n), the non-zero eigenvalues of AB and BA are 
the same and have the same multiplicity. If x is an eigenvector of AB for an eigenvalue 
\ #0, then y = Bx is an eigenvector of BA. 
COROLLARY 2.2 For A(n x p), B(q x n), a(p x 1), and b(q x 1) we have 

rank(Aab' B) < 1. 
The non-zero eigenvalue, if it exists, equals b'BAa (with eigenvector Aa). 
Proof: 


Theorem 2.6 asserts that the eigenvalues of Aab'B are the same as those of b'BAa. Note 
that the matrix b'BAa is a scalar and hence it is its own eigenvalue )y. 


Applying Aab' B to Aa yields 


(Aab' B)(Aa) = (Aa)(b" BAa) = A, Aa. 


2.6 Geometrical Aspects 71 


Ly U1 


Figure 2.1. Distance d. 


2.6 Geometrical Aspects 


Distance 


Let x,y € R®. A distance d is defined as a function 
d(x,y) > 0 Va #Y 
d:R??  R, which fulfills as,4) =O if and only ifw#=y . 
U(x,y) < d(x,z)+d(z,y) Va,y,z 


A Euclidean distance d between two points x and y is defined as 
d*(x,y) = (a —y)* A(x — y) (2.32) 


where A is a positive definite matrix (A > 0). A is called a metric. 


EXAMPLE 2.10 A particular case is when A = T,, i.e., 


P 


P (x,y) = >> (ai — yi)”. (2.33) 


i=1 


Figure 2.1 illustrates this definition for p = 2. 


Note that the sets Ey = {x € R? | (x — 20)'(x — 20) = d’} , i-e., the spheres with radius d 
and center 9, are the Euclidean Z,, iso-distance curves from the point x9 (see Figure 2.2). 


The more general distance (2.32) with a positive definite matrix A (A > 0) leads to the 
iso-distance curves 


Ey = {x € R? | (x — 29)' A(x — x9) = d*}, (2.34) 


i.e., ellipsoids with center x9, matrix A and constant d (see Figure 2.3). 
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Figure 2.2. Iso—distance sphere. 


LO1 


Figure 2.3. Iso—distance ellipsoid. 


Let 71,2, ---; Yp be the orthonormal eigenvectors of A corresponding to the eigenvalues A; > 
Ag >... => A». The resulting observations are given in the next theorem. 
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THEOREM 2.7 = (i) The principal ares of Eq are in the direction of y;; i =1,...,p. 
(ii) The half-lengths of the axes are view t= 1, oor, 0: 
(iii) The rectangle surrounding the ellipsoid Ey is defined by the following inequalities: 
ay — V@at < 2; < xg + Vea, i=1,...,p, 


where a” is the (i,i) element of A~'. By the rectangle surrounding the ellipsoid Ey we 
mean the rectangle whose sides are parallel to the coordinate axis. 


It is easy to find the coordinates of the tangency points between the ellipsoid and its sur- 
rounding rectangle parallel to the coordinate axes. Let us find the coordinates of the tangency 
point that are in the direction of the j-th coordinate axis (positive direction). 


For ease of notation, we suppose the ellipsoid is centered around the origin (xq = 0). If not, 
the rectangle will be shifted by the value of 29. 


The coordinate of the tangency point is given by the solution to the following problem: 


_ T 

b= arg Max €; x (2.35) 

where e is the j-th column of the identity matrix Z,. The coordinate of the tangency point 

in the negative direction would correspond to the solution of the min problem: by symmetry, 
it is the opposite value of the former. 


The solution is computed via the Lagrangian L = e} x — (x! Ax —d?) which by (2.23) leads 
to the following system of equations: 


OL 
OL 7 P 

i —d°‘—0. 2 
a = 2 Az —d° =0 (237) 


This gives 7 = aA 'e;, or componentwise 
—@4,i=1,...,p (2.38) 
where a”? denotes the (i, j)-th element of A7'. 
Premultiplying (2.36) by «', we have from (2.37): 
t= 2d". 
Comparing this to the value obtained by (2.38), for 7 = 7 we obtain 2\ = o We choose 


the positive value of the square root because we are maximizing a). A minimum would 
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correspond to the negative value. Finally, we have the coordinates of the tangency point 
between the ellipsoid and its surrounding rectangle in the positive direction of the j-th axis: 


w= 4) a", 4 1,20. (2.39) 


The particular case where 7 = j provides statement (i277) in Theorem 2.7. 


Remark: usefulness of Theorem 2.7 


Theorem 2.7 will prove to be particularly useful in many subsequent chapters. First, it 
provides a helpful tool for graphing an ellipse in two dimensions. Indeed, knowing the slope 
of the principal axes of the ellipse, their half-lengths and drawing the rectangle inscribing 
the ellipse allows one to quickly draw a rough picture of the shape of the ellipse. 


In Chapter 7, it is shown that the confidence region for the vector 4 of a multivariate 
normal population is given by a particular ellipsoid whose parameters depend on sample 
characteristics. The rectangle inscribing the ellipsoid (which is much easier to obtain) will 
provide the simultaneous confidence intervals for all of the components in ju. 


In addition it will be shown that the contour surfaces of the multivariate normal density 
are provided by ellipsoids whose parameters depend on the mean vector and on the covari- 
ance matrix. We will see that the tangency points between the contour ellipsoids and the 
surrounding rectangle are determined by regressing one component on the (p — 1) other 
components. For instance, in the direction of the j-th axis, the tangency points are given 
by the intersections of the ellipsoid contours with the regression line of the vector of (p — 1) 
variables (all components except the j-th) on the j-th component. 


Norm of a Vector 


Consider a vector « € R?. The norm or length of x (with respect to the metric Z,,) is defined 
as 


llae|| (0, 0) = Ve ae 


If ||x|| = 1,2 is called a unit vector. A more general norm can be defined with respect to the 
metric A: 


ella van An: 
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Figure 2.4. Angle between vectors. 


Angle between two Vectors 


Consider two vectors x and y € R?. The angle @ between x and y is defined by the cosine of 


0: 


T 


cos 6 = a (2.40) 
Ilr|l lly 
: ry Y 
see Figure 2.4. Indeed for p = 2, x = ( ) and y = ( ), we have 
£2 Ya 
|z||cos8; = 21; |lyl|cos8g = y 
. . 2.41 
|zrsind: = 22; [iyllsinds = mw, an 
therefore, 
Lyi + L2Y2 aly 


cos 4 = cos 6 cos Oy + sin dy sing = FT = 


REMARK 2.1 /fx'y =0, then the angle @ is equal to _ From trigonometry, we know that 


the cosine of @ equals the length of the base of a triangle (||p,||) divided by the length of the 
hypotenuse (||x||). Hence, we have 


al 
|Px|| = |||] cos A] = Tol (2.42) 
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I 


A 
De 


Figure 2.5. Projection. 


where py is the projection of x on y (which is defined below). It is the coordinate of x on the 
y vector, see Figure 2.5. 


The angle can also be defined with respect to a general metric A 
a! Ay 


2.43 
lela lvlla eee) 


cos 6 = 
If cos@ =0 then x is orthogonal to y with respect to the metric A. 


EXAMPLE 2.11 Assume that there are two centered (i.e., zero mean) data vectors. The 
cosine of the angle between them is equal to their correlation (defined in (3.8))! Indeed for 
x andy with = y =0 we have 


_  Liyi 
XY 
Vie yy 


= cos 


according to formula (2.40). 


Rotations 


When we consider a point x € R?’, we generally use a p-coordinate system to obtain its geo- 
metric representation, like in Figure 2.1 for instance. There will be situations in multivariate 
techniques where we will want to rotate this system of coordinates by the angle 60. 


Consider for example the point P with coordinates x = (1,%2)' in R? with respect to a 
given set of orthogonal axes. Let I be a (2 x 2) orthogonal matrix where 


Bef cos 6 oe (2.44) 


—sin@ cos 


If the axes are rotated about the origin through an angle @ in a clockwise direction, the new 
coordinates of P will be given by the vector y 


C= a, (2.45) 
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and a rotation through the same angle in a counterclockwise direction gives the new coordi- 
nates as 


y=T'a. (2.46) 
More generally, premultiplying a vector x by an orthogonal matrix I’ geometrically corre- 
sponds to a rotation of the system of axes, so that the first new axis is determined by the 
first row of [. This geometric point of view will be exploited in Chapters 9 and 10. 


Column Space and Null Space of a Matrix 


Define for ¥(n x p) 


Im(&) = C(#) = {x € R" | da € R” so that Xa =z}, 


the space generated by the columns of ¥ or the column space of ¥. Note that C(4’) C R" 
and dim{C(4)} = rank(¥) =r < min(n, p). 


Ker(v) “ N(X) = {y € RB? | ¥y =0} 
is the null space of X. Note that N(¥) C R? and that dim{N(¥)} =p-—r. 


REMARK 2.2 N(4"') is the orthogonal complement of C(X) in R", i.e., given a vector 
b ER” it will hold that x'b =0 for all x € C(&), if and only ifb Ee N(X"). 


EXAMPLE 2.12 Let %*X = . It is easy to show (e.g. by calculating the de- 


com & bw 
Ee ON ot 


3 
6 
8 
2 
terminant of X) that rank(¥) = 3. Hence, the columns space of X is C(X) = R’°. 


The null space of X contains only the zero vector (0,0,0)' and its dimension is equal to 
rank(¥) — 3 =0. 


, the third column is a multiple of the first one and the matrix X 


me WN 


cannot be of full rank. Noticing that the first two columns of X& are independent, we see that 
rank() = 2. In this case, the dimension of the columns space is 2 and the dimension of the 
null space is 1. 


Projection Matrix 


A matrix P(n x n) is called an (orthogonal) projection matrix in R” if and only ifP =P! = 
P? (P is idempotent). Let b € R". Then a = Pb is the projection of b on C(P). 
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Projection on C(%) 


Consider ¥(n x p) and let 
PHA XIX (2.47) 


and O=T, —P. It’s easy to check that P and Q are idempotent and that 
PX =X and OX =0. (2.48) 


Since the columns of ¥ are projected onto themselves, the projection matrix P projects any 
vector b € R” onto C(#). Similarly, the projection matrix Q projects any vector b € R” 
onto the orthogonal complement of C(%’). 


THEOREM 2.8 Let P be the projection (2.47) and Q its orthogonal complement. Then: 


(i) 2=]=Pe=> ae C4); 
(ai) y= Qb> yz =0Vr Ee C(X). 


Proof: 
(i) holds, since x = ¥(X¥'X)1X'b= Xa, where a= (X'X) 1X" ER’. 
(ii) follows from y =b—Pbandx=Xasyla=b' Xa—b'X(X'X) 1X! Xa =0. O 


REMARK 2.3 Let x,y € R” and consider p, € R", the projection of x on y (see Figure 
2.5). With X =y we have from (2.47) 


T,.\—1,,T ye 
Pr=yly yyy Z= le y (2.49) 


and we can easily verify that 


_ 
a 
ipt=aeu.= 


lIyll 
See again Remark 2.1. 


2.7. Exercises i) 


a" Summary 


<> A distance between two p-dimensional points x and y is a quadratic form 
(x — y)' A(x — y) in the vectors of differences (x — y). A distance defines 
the norm of a vector. 


<> Iso-distance curves of a point Xo are all those points that have the same 
distance from 2%. Iso-distance curves are ellipsoids whose principal axes 
are determined by the direction of the eigenvectors of A. The half-length of 
principal axes is proportional to the inverse of the roots of the eigenvalues 


of A. 


The angle between two vectors x and y is given by cos 8 = | w.r.t. 


al 
; jzlla Ilylla 
the metric A. 


<> For the Euclidean distance with A = Z the correlation between two cen- 
tered data vectors x and y is given by the cosine of the angle between 
them, i.e., cos? = rxy. 


— The projection P = X(X'X)-1X" is the projection onto the column 
space C'(¥) of X. 


_ yle 


The projection of z € R” on y € R” is given by pe = jyzy- 


2./ Exercises 


EXERCISE 2.1 Compute the determinant for a (3 x 3) matriz. 
EXERCISE 2.2 Suppose that |A| = 0. Is it possible that all eigenvalues of A are positive? 


EXERCISE 2.3 Suppose that all eigenvalues of some (square) matrix A are different from 
zero. Does the inverse A~' of A exist? 


EXERCISE 2.4 Write a program that calculates the Jordan decomposition of the matrix 


Check Theorem 2.1 numerically. 
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EXERCISE 2.5 Prove (2.23), (2.24) and (2.25). 
EXERCISE 2.6 Show that a projection matrix only has eigenvalues in {0,1}. 
EXERCISE 2.7 Draw some iso-distance ellipsoids for the metric A = X~! of Example 3.19. 


EXERCISE 2.8 Find a formula for |A+aa'| and for (A+ aa')~+. (Hint: use the inverse 
cat 
partitioned matrix with B = ( . ae ) ) 


EXERCISE 2.9 Prove the Binomial inverse theorem for two non-singular matrices A(p x p) 
and Bip x p): (A+B)! = At-AN(A1+B) 1A. (Hint: use (2.26) with C = 
AI, 
—I, Bo 7 


3 Moving to Higher Dimensions 


We have seen in the previous chapters how very simple graphical devices can help in under- 
standing the structure and dependency of data. The graphical tools were based on either 
univariate (bivariate) data representations or on “slick” transformations of multivariate infor- 
mation perceivable by the human eye. Most of the tools are extremely useful in a modelling 
step, but unfortunately, do not give the full picture of the data set. One reason for this is 
that the graphical tools presented capture only certain dimensions of the data and do not 
necessarily concentrate on those dimensions or subparts of the data under analysis that carry 
the maximum structural information. In Part II of this book, powerful tools for reducing 
the dimension of a data set will be presented. In this chapter, as a starting point, simple and 
basic tools are used to describe dependency. They are constructed from elementary facts of 
probability theory and introductory statistics (for example, the covariance and correlation 
between two variables). 


Sections 3.1 and 3.2 show how to handle these concepts in a multivariate setup and how a 
simple test on correlation between two variables can be derived. Since linear relationships 
are involved in these measures, Section 3.4 presents the simple linear model for two variables 
and recalls the basic t-test for the slope. In Section 3.5, a simple example of one-factorial 
analysis of variance introduces the notations for the well known F-test. 


Due to the power of matrix notation, all of this can easily be extended to a more general 
multivariate setup. Section 3.3 shows how matrix operations can be used to define summary 
statistics of a data set and for obtaining the empirical moments of linear transformations of 
the data. These results will prove to be very useful in most of the chapters in Part II. 


Finally, matrix notation allows us to introduce the flexible multiple linear model, where more 
general relationships among variables can be analyzed. In Section 3.6, the least squares 
adjustment of the model and the usual test statistics are presented with their geometric 
interpretation. Using these notations, the ANOVA model is just a particular case of the 
multiple linear model. 
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3.1 Covariance 


Covariance is a measure of dependency between random variables. Given two (random) 
variables X and Y the (theoretical) covariance is defined by: 


oxy = Cou(X,Y) = E(XY) — (EX)(EY). (3.1) 


The precise definition of expected values is given in Chapter 4. If X and Y are independent 
of each other, the covariance Cov(X, Y) is necessarily equal to zero, see Theorem 3.1. The 
converse is not true. The covariance of X with itself is the variance: 


oxx = Var(X) = Cov(X, X). 


X1 
If the variable X is p-dimensional multivariate, e.g., X = : , then the theoretical 
Xp 
covariances among all the elements are put into matrix form, i.e., the covariance matrix: 
OxXiX, «+> OX1Xp 
OX,X1 aise OX,Xp 


Properties of covariance matrices will be detailed in Chapter 4. Empirical versions of these 
quantities are: 


ae = 7 (z;-—=)(% — 9) (3.2) 
—— > (2; — 8). (3.3) 


For small n, say n < 20, we should replace the factor + in (3.2) and (3.3) by —+ in order 
to correct for a small bias. For a p-dimensional random variable, one obtains the empirical 


covariance matrix (see Section 3.3 for properties and details) 


SX1X1 +e: SX1Xp 
S= . . 


SXpX1 +++ SXpXp 


For a scatterplot of two variables the covariances measure “how close the scatter is to a 
line”. Mathematical details follow but it should already be understood here that in this 
sense covariance measures only “linear dependence”. 
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EXAMPLE 3.1 /f X is the entire bank data set, one obtains the covariance matrix S as 
indicated below: 


0.14 0.03 0.02 —0.10 —0.01 0.08 
0.03 012 0.10 0.21 0.10 —0.21 
0.02 010 016 0.28 0.12 —0.24 
—0.10 0.21 0.28 2.07 0.16 —1.03 
—0.01 0.10 012 O16 0.64 —0.54 
0.08 —0.21 —0.24 -1.03 -0.54 1.32 


(3.4) 


The empirical covariance between X4 and X5, 1.€., Sx,x;, 18 found in row 4 and column 5. 
The value is Sx,x, = 0.16. Is it obvious that this value is positive? In Exercise 3.1 we will 
discuss this question further. 


If X; denotes the counterfeit bank notes, we obtain: 


0.123 0.031 0.023 —0.099 0.019 0.011 
0.031 0.064 0.046 —0.024 —0.012 —0.005 
S.= 0.024 0.046 0.088 —0.018 0.000 0.034 |_ (3.5) 
f —0.099 —0.024 —0.018 1.268 —0.485 0.236 ; 
0.019 —0.012 0.000 —0.485 0.400 —0.022 
0.011 —0.005 0.034 0.236 —0.022 0.308 


For the genuine, X,, we have: 


0.149 0.057 0.057 0.056 0.014 0.005 
0.057 0.131 0.085 0.056 0.048 —0.043 
0.057 0.085 0.125 0.058 0.030 —0.024 
0.056 0.056 0.058 0.409 —0.261 —0.000 
0.014 0.049 0.030 —0.261 0.417 —0.074 
0.005 —0.043 —0.024 —0.000 —0.074 0.198 


Note that the covariance between X4 (distance of the frame to the lower border) and X; 
(distance of the frame to the upper border) is negative in both (3.5) and (3.6)! Why would 
this happen? In Exercise 3.2 we will discuss this question in more detail. 


At first sight, the matrices S; and S, look different, but they create almost the same scatter- 
plots (see the discussion in Section 1.4). Similarly, the common principal component analysis 
in Chapter 9 suggests a joint analysis of the covariance structure as in Flury and Riedwyl 
(1988). 


Scatterplots with point clouds that are “upward-sloping”, like the one in the upper left of 
Figure 1.14, show variables with positive covariance. Scatterplots with “downward-sloping” 
structure have negative covariance. In Figure 3.1 we show the scatterplot of X4 vs. X5 of 
the entire bank data set. The point cloud is upward-sloping. However, the two sub-clouds 
of counterfeit and genuine bank notes are downward-sloping. 
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Swiss bank notes 


Figure 3.1. Scatterplot of variables X4 vs. X5 of the entire bank data 


set. Q MVAscabank45.xpl 


EXAMPLE 3.2 A textile shop manager is studying the sales of “classic blue” pullovers over 
10 different periods. He observes the number of pullovers sold (X,), variation in price (Xo, 
in EUR), the advertisement costs in local newspapers (X3, in EUR) and the presence of a 
sales assistant (X4, in hours per period). Over the periods, he observes the following data 


matrix: 

230 
181 
165 
150 

97 
192 
181 
189 
172 
170 


125 200 


99 
97 
115 
120 
100 
80 
90 
95 
125 


59 
105 
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Figure 3.2. Scatterplot of variables X2 vs. X1 of the pullovers data set. 
Q MVAscapull1.xpl 


He is convinced that the price must have a large influence on the number of pullovers sold. 
So he makes a scatterplot of X2 vs. X1, see Figure 3.2. A rough impression is that the cloud 
is somewhat downward-sloping. A computation of the empirical covariance yields 


(gee P 7 
SX1X_2 = 9 S° (Xi = X,) (Xo; = X2) = —80.02, 


i=1 

a negative value as expected. 

Note: The covariance function is scale dependent. Thus, if the prices in this example were 
in Japanese Yen (JPY), we would obtain a different answer (see Exercise 3.16). A measure 


of (linear) dependence independent of the scale is the correlation, which we introduce in the 
next section. 
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car Summary 


The covariance is a measure of dependence. 


Covariance measures only linear dependence. 


Covariance is scale dependent. 


There are nonlinear dependencies that have zero covariance. 


Zero covariance does not imply independence. 


Independence implies zero covariance. 


Negative covariance corresponds to downward-sloping scatterplots. 


Positive covariance corresponds to upward-sloping scatterplots. 


{) f) 0) fy) fy fp fy) cy 


The covariance of a variable with itself is its variance Cov(X, X) = oxx = 
2 
OX. 


For small n, we should replace the factor t in the computation of the 


covariance by 4. 


3.2 Correlation 


The correlation between two variables X and Y is defined from the covariance as the follow- 

ing: 

_ Cou(X,Y) 
/ Var(X) Var(Y) 


PXY (3.7) 


The advantage of the correlation is that it is independent of the scale, i.e., changing the 
variables’ scale of measurement does not change the value of the correlation. Therefore, the 
correlation is more useful as a measure of association between two random variables than 
the covariance. The empirical version of pxy is as follows: 


SXY 


VSXXSYY 


The correlation is in absolute value always less than 1. It is zero if the covariance is zero 
and vice-versa. For p-dimensional vectors (X;,...,X;,)' we have the theoretical correlation 
matrix 


(3.8) 


TXKY = 


PX1X1 +++ PX X> 


PXpX1 +++ PXpXp 
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and its empirical version, the empirical correlation matrix which can be calculated from the 
observations, 
XX ede TX Xp 


R= 


TXpX1 Peer TXpXp 


EXAMPLE 3.3 We obtain the following correlation matrix for the genuine bank notes: 


1.00 O41 O41 0.22 005 0.03 
0.41 1.00 066 0.24 0.20 —0.25 
0.41 066 100 0.25 £0.13 —0.14 
0.22 0.24 0.25 1.00 —0.63 W—0.00 
0.05 0.20 0.13 —0.63 1.00 —0.25 
0.03 —0.25 —0.14 -—0.00 —0.25 = 1.00 


and for the counterfeit bank notes: 


1.00 035 0.24 —0.25 0.08 0.06 
0.35 1.00 0.61 —0.08 —0.07 —0.03 
0.24 0.61 1.00 —0.05 0.00 0.20 
—0.25 —0.08 —0.05 1.00 —0.68 0.37 
0.08 —0.07 0.00 —0.68 1.00 —0.06 
0.06 —0.03 0.20 0.37 —0.06 — 1.00 


Ry = (3.10) 


As noted before for Cov(X4, X35), the correlation between X4 (distance of the frame to the 
lower border) and X5 (distance of the frame to the upper border) is negative. This is natural, 
since the covariance and correlation always have the same sign (see also Exercise 3.17). 


Why is the correlation an interesting statistic to study? It is related to independence of 
random variables, which we shall define more formally later on. For the moment we may 
think of independence as the fact that one variable has no influence on another. 


THEOREM 3.1 /f X and Y are independent, then p(X,Y) = Cov(X,Y) =0. 
In general, the converse is not true, as the following example shows. 
EXAMPLE 3.4 Consider a standard normally-distributed random variable X and a random 
variable Y = X?, which is surely not independent of X. Here we have 
Cou(X,Y) = E(XY) — E(X)E(Y) = E(X*) =0 


(because E(X) = 0 and E(X*) = 1). Therefore p(X,Y) = 0, as well. This example 
also shows that correlations and covariances measure only linear dependence. The quadratic 
dependence of Y = X* on X is not reflected by these measures of dependence. 
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REMARK 3.1 For two normal random variables, the converse of Theorem 3.1 is true: zero 
covariance for two normally-distributed random variables implies independence. This will be 
shown later in Corollary 5.2. 


Theorem 3.1 enables us to check for independence between the components of a bivariate 
normal random variable. That is, we can use the correlation and test whether it is zero. The 
distribution of rxyy for an arbitrary (X,Y) is unfortunately complicated. The distribution 
of rxy will be more accessible if (X,Y) are jointly normal (see Chapter 5). If we transform 
the correlation by Fisher’s Z-transformation, 


1 l+rxy 
W==1 a 
stor (722). (3.11) 


we obtain a variable that has a more accessible distribution. Under the hypothesis that 
p = 0, W has an asymptotic normal distribution. Approximations of the expectation and 
variance of W are given by the following: 


E(W) = z log (j22xx ) 


AREY (3.12) 
Var(W) & =e 
The distribution is given in Theorem 3.2. 
THEOREM 3.2 
W — E(W 
Z= OE) 2 N(0, 1). (3.13) 
Var(W) 


The symbol «.” denotes convergence in distribution, which will be explained in more 
detail in Chapter 4. 


Theorem 3.2 allows us to test different hypotheses on correlation. We can fix the level of 
significance a (the probability of rejecting a true hypothesis) and reject the hypothesis if the 
difference between the hypothetical value and the calculated value of Z is greater than the 
corresponding critical value of the normal distribution. The following example illustrates 
the procedure. 


EXAMPLE 3.5 Let’s study the correlation between mileage (X2) and weight (Xs) for the 
car data set (B.3) where n = 74. We have rx,x, = —0.823. Our conclusions from the 
boxplot in Figure 1.3 (“Japanese cars generally have better mileage than the others”) needs 
to be revised. From Figure 3.3 and rx,x,, we can see that mileage is highly correlated with 
weight, and that the Japanese cars in the sample are in fact all lighter than the others! 
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If we want to know whether px,x, 18 significantly different from po = 0, we apply Fisher’s 
Z-transform (3.11). This gives us 


1 1 ~1.166 — 0 
w = —log ( aE rast S166 ond goo a6 oF 
2 1— [T XoXg 1 


71 


i.e., a highly significant value to reject the hypothesis that p = 0 (the 2.5% and 97.5% 
quantiles of the normal distribution are —1.96 and 1.96, respectively). If we want to test the 
hypothesis that, say, Po = —0.75, we obtain: 


= 1.166 = (—0. 
Z= al al EES = =1077. 


al 


71 


This is a nonsignificant value at the a = 0.05 level for z since it is between the critical values 
at the 5% significance level (i.e., —1.96 < z < 1.96). 


EXAMPLE 3.6 Let us consider again the pullovers data set from example 3.2. Consider the 
correlation between the presence of the sales assistants (X4) vs. the number of sold pullovers 
(X1) (see Figure 3.4). Here we compute the correlation as 


XxX, = 0.633. 
The Z-transform of this value is 


w = — log 


1 Soe oe 
9 e 


) = 0.746. (3.14) 


1 TX XK 


The sample size is n = 10, so for the hypothesis px,x, = 0, the statistic to consider is: 
z = V7(0.746 — 0) = 1.974 (3.15) 


which is just statistically significant at the 5% level (i.e., 1.974 is just a little larger than 
1.96). 


REMARK 3.2 The normalizing and variance stabilizing properties of W are asymptotic. In 
addition the use of W in small samples (for n < 25) is improved by Hotelling’s transform 
(Hotelling, 1953): 


3W + tanh(W) 1 


*=W —- ith = 
W*=W An -1) wi Var(W*) 


a= 1 


The transformed variable W* is asymptotically distributed as a normal distribution. 
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car data 


1500+weight (X8)*E2 
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Figure 3.3. Mileage (X2) vs. weight (Xg) of U.S. (star), European (plus 
signs) and Japanese (circle) cars. Q MVAscacar.xpl 


EXAMPLE 3.7 From the preceding remark, we obtain w* = 0.6663 and /10 — 1w* = 1.9989 
for the preceding Example 3.6. This value is significant at the 5% level. 


REMARK 3.3 Note that the Fisher’s Z-transform is the inverse of the hyperbolic tangent 
function: W = tanh | (rxy); equivalently rxy = tanh(W) = rat: 

REMARK 3.4 Under the assumptions of normality of X and Y, we may test their indepen- 
dence (pxy = 0) using the exact t-distribution of the statistic 


n-2 pxy=0 4 


T =rxy n—2: 


D) 
1—Tyy 


Setting the probability of the first error type to a, we reject the null hypothesis pxy = 0 if 
[| ieee 
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Figure 3.4. Hours of sales assistants (X4) vs. sales (X,) of pullovers. 


Q MVAscapull2.xpl 


5 


Summary 


The correlation is a standardized measure of dependence. 


The absolute value of the correlation is always less than one. 


Correlation measures only linear dependence. 


There are nonlinear dependencies that have zero correlation. 


Zero correlation does not imply independence. 


Independence implies zero correlation. 


Negative correlation corresponds to downward-sloping scatterplots. 


bea) |b) 2) Et 


Positive correlation corresponds to upward-sloping scatterplots. 
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Summary (continued) 


<+ Fisher’s Z-transform helps us in testing hypotheses on correlation. 


<< For small samples, Fisher’s Z-transform can be improved by the transfor- 


mation W* = W — oa 


3.3. Summary Statistics 


This section focuses on the representation of basic summary statistics (means, covariances 
and correlations) in matrix notation, since we often apply linear transformations to data. 
The matrix notation allows us to derive instantaneously the corresponding characteristics of 
the transformed variables. The Mahalanobis transformation is a prominent example of such 
linear transformations. 


Assume that we have observed n realizations of a p-dimensional random variable; we have a 
data matrix V(n x p): 


V4 aoe Lip 
x=] Sie (3.16) 
Unt c'* Inp 
The rows x; = (j1,-.., ip) € R? denote the i-th observation of a p-dimensional random 


variable X € R?”. 


The statistics that were briefly introduced in Section 3.1 and 3.2 can be rewritten in matrix 
form as follows. The “center of gravity” of the n observations in R” is given by the vector & 
of the means %, of the p variables: 


=| = | Sn72" 1. (3.17) 


Tp 


The dispersion of the n observations can be characterized by the covariance matrix of the 
p variables. The empirical covariances defined in (3.2) and (3.3) are the elements of the 
following matrix: 


San'A'X-EzE' Hn (AX —n 1X4" 1pl) 4). (3.18) 


Note that this matrix is equivalently defined by 
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The covariance formula (3.18) can be rewritten as S = n-!'XY'HA#X with the centering matrix 


2G, =2 “1p. (3.19) 
Note that the centering matrix is symmetric and idempotent. Indeed, 
Ww? = (,-n 41,1) )Z, —n 1,1) 
= T,— 0 I_l) = al) + “Tal Tal, ) 


= J,—n 11,1) =H. 


n 


As a consequence S is positive semidefinite, i.e. 


S>0. (3.20) 


Indeed for all a € R?, 
a'Sa = nla'X'HXa 
= n'(a'X'TH')\(HXa) since H'H=H, 


Pp 
= ny you yy SO 
j=l 


for y = HXa. It is well known from the one-dimensional case that n~!S7)",(a; — %) 
as an estimate of the variance exhibits a bias of the order n~! (Breiman, 1973). In the 
multidimensional case, S,, = ="; S is an unbiased estimate of the true covariance. (This will 
be shown in Example 4.15.) 


The sample correlation coefficient between the i-th and j-th variables is rx,x,, see (3.8). If 
D = diag(sx,x,), then the correlation matrix is 


R=DMspV2, 21) 


-1/2 


where D~'/? is a diagonal matrix with elements (sx,x,) on its main diagonal. 


EXAMPLE 3.8 The empirical covariances are calculated for the pullover data set. 
The vector of the means of the four variables in the dataset is Z = (172.7, 104.6, 104.0, 93.8)". 


1037.2 —80.2 1430.7 271.4 
—80.2 219.8 92.1 —91.6 
1430.7 92.1 2624 210.3 
271.4 -916 210.3 177.4 


The sample covariance matrix is S = 


The unbiased estimate of the variance (n =10) is equal to 


1152.5 —88.9 1589.7 301.6 
—88.9 2443 102.3 —101.8 
1589.7 = 102.3. 2915.6 =. 233.7 
301.6 —101.8 233.7 197.1 


10 
So ee 
9 
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1 —0.17 0.87 0.63 
. 8 —0.17 1 0.12 —0.46 
The sample correlation matrix is R = 087 012 1 0.31 


0.63 —0.46 0.31 1 


Linear Transformation 


In many practical applications we need to study linear transformations of the original data. 
This motivates the question of how to calculate summary statistics after such linear trans- 
formations. 


Let A be a (q x p) matrix and consider the transformed data matrix 
Y= XA" = (yt,---,Ym)". (3.22) 


The row y; = (Yi,---,Yiq) € R* can be viewed as the i-th observation of a g-dimensional 
random variable Y = AX. In fact we have y; = x;A'. We immediately obtain the mean 
and the empirical covariance of the variables (columns) forming the data matrix ): 


i 1 
y = mee = — AX" In = Az (3.23) 


sy = “yTHy = “ANTHXAT = ASyA'. (3.24) 
Note that if the linear transformation is nonhomogeneous, i.e., 
=Azr,+b where d(q x 1), 

only (3.23) changes: 7 = ae +b. The formula (3.23) and (3.24) are useful in the particular 


case of g= lie, y= Xasy=a'lai=l,...,n: 
7 = iG 
Sy = a Swit, 


EXAMPLE 3.9 Suppose that X is the pullover data set. The manager wants to compute 
his mean expenses for advertisement (X3) and sales assistant (X4). 


Suppose that the sales assistant charges an hourly wage of 10 EUR. Then the shop manager 
calculates the expenses Y as Y = X3+10X4. Formula (3.22) says that this is equivalent to 
defining the matriz A(4 x 1) as: 

A= (0,0, 1,10): 
Using formulas (3.23) and (3.24), it is now computationally very easy to obtain the sample 
mean y and the sample variance S, of the overall expenses: 


172.7 
y = Az = (0,0, 1, 10) ae = 1042.0 


93.8 
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1152.5 —88.9 1589.7 301.6 
—88.9 2443 102.3 —101.8 
1589.7 = 102.3. 2915.6 =. 233.7 
301.6 —101.8 233.7 197.1 1 


= 2915.6 + 4674 + 19710 = 27299.6. 


Sy = ASxA' = (0,0, 1, 10) 


orooeo 


Mahalanobis Transformation 


A special case of this linear transformation is 


i a a 2 en 9 (3.25) 
Note that for the transformed data matrix Z = (2,...,2n)', 
Sen 22 =7,. (3.26) 


So the Mahalanobis transformation eliminates the correlation between the variables and 
standardizes the variance of each variable. If we apply (3.24) using A = S~'/?, we obtain 
the identity covariance matrix as indicated in (3.26). 


ar" Summary 


<< The center of gravity of a data matrix is given by its mean vector ¥ = 
-1yT 
n-X 1). 
< The dispersion of the observations in a data matrix is given by the empir- 
ical covariance matrix S = n7!X THX. 


[ 


The empirical correlation matrix is given by R = D~/2SD~'/?, 


A linear transformation VY = YA! of a data matrix Y has mean AZ and 
empirical covariance ASyA'. 


[ 


<+ The Mahalanobis transformation is a linear transformation z; = S~! 2(xj- 
Z) which gives a standardized, uncorrelated data matrix Z. 


3.4 Linear Model for Two Variables 


We have looked many times now at downward- and upward-sloping scatterplots. What does 
the eye define here as slope? Suppose that we can construct a line corresponding to the 
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general direction of the cloud. The sign of the slope of this line would correspond to the 
upward and downward directions. Call the variable on the vertical axis Y and the one on 
the horizontal axis X. A slope line is a linear relationship between X and Y: 


y, = at Ox, +e;, i=1,...,n. (3.27) 


Here, a is the intercept and ( is the slope of the line. The errors (or deviations from the 
line) are denoted as €; and are assumed to have zero mean and finite variance o*. The task 
of finding (a, 3) in (3.27) is referred to as a linear adjustment. 


In Section 3.6 we shall derive estimators for a and ( more formally, as well as accurately 
describe what a “good” estimator is. For now, one may try to find a “good” estimator (@, () 
via graphical techniques. A very common numerical and statistical technique is to use those 


a@ and B that minimize: 


(@, 3) =argmin  (y, — a — Gz;)’. (3.28) 


6a 2 (3.29) 
SXX 
a = yz. (3.30) 
The variance of B is: 
~ o2 
V = . 3.31 
ar(3) = —" (3.31) 


The standard error (SE) of the estimator is the square root of (3.31), 
oO 


fee ie (3.92) 


SE(8) = {Var()}¥? = 
We can use this formula to test the hypothesis that G=0. In an application the variance 
o” has to be estimated by an estimator o? that will be given below. Under a normality 
assumption of the errors, the t-test for the hypothesis G = 0 works as follows. 


One computes the statistic 
ee (3.33) 
SE({) 


and rejects the hypothesis at a 5% significance level if | t |> to.975.n-2, where the 97.5% 
quantile of the Student’s t,,_2 distribution is clearly the 95% critical value for the two-sided 
test. For n > 30, this can be replaced by 1.96, the 97.5% quantile of the normal distribution. 
An estimator o? of o? will be given in the following. 
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Figure 3.5. Regression of sales (X,) on price (X2) of pullovers. 
Q MVAregpull.xpl 


EXAMPLE 3.10 Let us apply the linear regression model (3.27) to the “classic blue” pullovers. 
The sales manager believes that there is a strong dependence on the number of sales as a 
function of price. He computes the regression line as shown in Figure 3.5. 


How good is this fit? This can be judged via goodness-of-fit measures. Define 


n~ a~ 


Wi=at+ Bai, (3.34) 


as the predicted value of y as a function of x. With ¥ the textile shop manager in the above 
example can predict sales as a function of prices x. The variation in the response variable 
is: 


nsyy = (yi — 9). (3.35) 
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The variation explained by the linear regression (3.27) with the predicted values (3.34) is: 


n 


SG -9). (3.36) 
i=1 
The residual sum of squares, the minimum in (3.28), is given by: 
RSS =) (yi — Hi). (3.37) 
i=1 
An unbiased estimator G? of o? is given by RSS/(n — 2). 
The following relation holds between (3.35)—(3.37): 


Su-w = VG-W+VuU-H, (3.38) 
i=1 i=1 i=1 
total variation = explained variation + unexplained variation. 


The coefficient of determination is r?: 


Je 
) 
= 


explained variation 


N 
3. 
ll 
mn 


_., total variation 
(yi = y)? 


I 


(3.39) 


Ms 


lI 
pei 


The coefficient of determination increases with the proportion of explained variation by the 
linear relation (3.27). In the extreme cases where r? = 1, all of the variation is explained by 
the linear regression (3.27). The other extreme, r? = 0, is where the empirical covariance is 
Sxy =0. The coefficient of determination can be rewritten as 


DY — i)? 


From (3.39), it can be seen that in the linear regression (3.27), r? = r%,- is the square of 
the correlation between X and Y. 


EXAMPLE 3.11 For the above pullover example, we estimate 
@=210.774 and 8 = —0.364. 


The coefficient of determination is 
7 = 0028. 


The textile shop manager concludes that sales are not influenced very much by the price (in 
a linear way). 
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Figure 3.6. Regression of sales (X 1) on price (X2) of pullovers. The overall 
mean is given by the dashed line. Q MVAregzoom.xp1l 


The geometrical representation of formula (3.38) can be graphically evaluated using Fig- 
ure 3.6. This plot shows a section of the linear regression of the “sales” on “price” for the 
pullovers data. The distance between any point and the overall mean is given by the distance 
between the point and the regression line and the distance between the regression line and 
the mean. The sums of these two distances represent the total variance (solid blue lines 
from the observations to the overall mean), i.e., the explained variance (distance from the 
regression curve to the mean) and the unexplained variance (distance from the observation 
to the regression line), respectively. 


In general the regression of Y on X is different from that of X on Y. We will demonstrate 
this using once again the Swiss bank notes data. 


EXAMPLE 3.12 The least squares fit of the variables X4 (X ) and X5 (Y ) from the genuine 
bank notes are calculated. Figure 3.7 shows the fitted line if Xs 1s approximated by a linear 
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Figure 3.7. Regression of X; (upper inner frame) on X, (lower inner 
frame) for genuine bank notes. Q MVAregbank.xpl 


function of X4. In this case the parameters are 


@=15.464 and = —0.638. 


If we predict X4 by a function of X5 instead, we would arrive at a different intercept and 
slope . 
a=14666 and (= -—0.626. 


The linear regression of Y on X is given by minimizing (3.28), i.e., the vertical errors ¢;. The 
linear regression of X on Y does the same but here the errors to be minimized in the least 
squares sense are measured horizontally. As seen in Example 3.12, the two least squares lines 
are different although both measure (in a certain sense) the slope of the cloud of points. 


As shown in the next example, there is still one other way to measure the main direction of 
a cloud of points: it is related to the spectral decomposition of covariance matrices. 
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normal sample, n=150 


Figure 3.8. Scatterplot for a sample of two correlated normal random 
variables (sample size n = 150, p = 0.8). Q MVAcorrnorm.xp1l 


EXAMPLE 3.13 Suppose that we have the following covariance matrix: 


Figure 3.8 shows a scatterplot of a sample of two normal random variables with such a 
covariance matrix (with p = 0.8). 


The eigenvalues of % are, as was shown in Example 2.4, solutions to: 


a p 


pity |-0 


Hence, 41 = 1+) and» =1-p. Therefore A = diag(1+p,1—p). The eigenvector 
corresponding to \y = 1+ p can be computed from the system of linear equations: 


(i )(a)-era(s) 
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or 
Lj plo = 2+ PT 
pLi+%y = Lo+ PL 
and thus 
“v1 = 


The first (standardized) eigenvector is 


fv 
The direction of this eigenvector is the diagonal in Figure 3.8 and captures the main variation 
in this direction. We shall come back to this interpretation in Chapter 9. The second 


eigenvector (orthogonal to y,) is 
_( 2 


1/f2 1/V2 
r=Gund= (45 5) 


and we can check our calculation by 


So finally 


S=TADr'. 


The first eigenvector captures the main direction of a point cloud. The linear regression of 
Y on X and X on Y accomplished, in a sense, the same thing. In general the direction of 
the eigenvector and the least squares slope are different. The reason is that the least squares 
estimator minimizes either vertical or horizontal errors (in 3.28), whereas the first eigenvector 
corresponds to a minimization that is orthogonal to the eigenvector (see Chapter 9). 


iar Summary 


<— The linear regression y = a+ 9x +e models a linear relation between two 
one-dimensional variables. 


<— The sign of the slope (@ is the same as that of the covariance and the 
correlation of x and y. 


<> A linear regression predicts values of Y given a possible observation x of 
Xx. 


3.5 Simple Analysis of Variance 
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Summary (continued) 


The coefficient of determination r? measures the amount of variation in 
Y which is explained by a linear regression on X. 


| 


If the coefficient of determination is r? = 1, then all points lie on one line. 


The regression line of X on Y and the regression line of Y on X are in 
general different. 


[ 


* = é = B Rv é 
The t-test for the hypothesis @ = 0 is t = SEB)” where SE((3) = Waxy) VE" 


[ 


The t-test rejects the null hypothesis 3 = 0 at the level of significance a 
if | t |> ti-asen—2 where ty-ain—2 is the 1 — a/2 quantile of the Student’s 
t-distribution with (n — 2) degrees of freedom. 


The standard error S'F() increases/decreases with less/more spread in 
the X variables. 


The direction of the first eigenvector of the covariance matrix of a two- 
dimensional point cloud is different from the least squares regression line. 


3.5 Simple Analysis of Variance 


In asimple (i.e., one-factorial) analysis of variance (ANOVA), it is assumed that the average 
values of the response variable y are induced by one simple factor. Suppose that this factor 
takes on p values and that for each factor level, we have m = n/p observations. The sample 
is of the form given in Table 3.5, where all of the observations are independent. 


sample element factor levels 1 
1 Yr Ub ttt Up 
9 ; ; : 
k Yi o*7*) Ykb ttt Ukp 
=n] p Ym1 asd Yml eats Ymp 


Table 3.5. Observation structure of a simple ANOVA. 


The goal of a simple ANOVA is to analyze the observation structure 


we = fe ee tot A= Vesey ed = Ta. og: 


(3.41) 
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shop | marketing strategy 
k; factor I 
1 2 3 
1 10 18 
2 11 15 14 
3 10 11 17 
4 | 12 15 9 
5 4 15 14 
6 11 13 17 
7 | 12 7 16 
8 10 15 14 
9 11 13 17 
IQ | 13 10 15 


Table 3.6. Pullover sales as function of marketing strategy. 


Each factor has a mean value py. Each observation yx; is assumed to be a sum of the 
corresponding factor mean value jz; and a zero mean random error €,;. The linear regression 
model falls into this scheme with m = 1, p=n and p; = a+ (2;, where x; is the 7-th level 
value of the factor. 


EXAMPLE 3.14 The “classic blue” pullover company analyzes the effect of three marketing 
strategies 


1 advertisement in local newspaper, 
2 presence of sales assistant, 
3 luxury presentation in shop windows. 


All of these strategies are tried in 10 different shops. The resulting sale observations are 
given in Table 3.6. 


There are p = 3 factors and n = mp = 30 observations in the data. The “classic blue” 
pullover company wants to know whether all three marketing strategies have the same mean 
effect or whether there are differences. Having the same effect means that all pu, in (3.41) 
equal one value, uz. The hypothesis to be tested is therefore 


Hy? ip = ph for 1S lca Ds: 


The alternative hypothesis, that the marketing strategies have different effects, can be formu- 
lated as 
Ay: wi A pw for some l andl’. 


This means that one marketing strategy 1s better than the others. 
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The method used to test this problem is to compute as in (3.38) the total variation and to 
decompose it into the sources of variation. This gives: 


Yo Sa - 9) =m 0H - 9)? +5 On — W? (3.42) 


l=1 k=1 I=1 l=1 k=1 


The total variation (sum of squares=SS) is 


pom 
S'S(reduced) = ae (Yeu — (3.43) 


l=1 k=1 


where y = n-' S77, op2, Yet is the overall mean. Here the total variation is denoted as 
S'S(reduced), since in comparison with the model under the alternative H,, we have a reduced 
set of parameters. In fact there is 1 parameter jz = 4; under Ho. Under Hy, the “full” model, 
we have three parameters, namely the three different means 1. 


The variation under H, is therefore: 
Dp om 
SS (full) = $0 Soe — m1) (3.44) 


where y =m! >>", yu is the mean of each factor |. The hypothetical model Hp is called 
reduced, since it has (relative to H,) fewer parameters. 


The F-test of the linear hypothesis is used to compare the difference in the variations under 
the reduced model Hp (3.43) and the full model AH (3.44) to the variation under the full 


model Hy: 
{S'S(reduced) — S'S(full)}/{df(r) — df (f)} 
S'S'(full) /df (f) 
Here df(f) and df(r) denote the degrees of freedom under the full model and the reduced 
model respectively. The degrees of freedom are essential in specifying the shape of the F- 
distribution. They have a simple interpretation: df(-) is equal to the number of observations 
minus the number of parameters in the model. 


F= 


(3.45) 


From Example 3.14, p = 3 parameters are estimated under the full model, i-e., df(f) = 
n — p = 30 — 3 = 27. Under the reduced model, there is one parameter to estimate, namely 
the overall mean, i.e., df(r) =n — 1 = 29. We can compute 


S'S'(reduced) = 260.3 
and 
SS (full) = 157.7. 
The F-statistic (3.45) is therefore 


(260.3 — 157.7) /2 
157.7/27 


P= = 8.78. 
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This value needs to be compared to the quantiles of the F227 distribution. Looking up the 
critical values in a F’-distribution shows that the test statistic above is highly significant. 
We conclude that the marketing strategies have different effects. 


The F-test in a linear regression model 


The t-test of a linear regression model can be put into this framework. For a linear regression 
model (3.27), the reduced model is the one with 6 = 0: 
Yi = A+0- 2+ &. 
The reduced model has n — 1 degrees of freedom and one parameter, the intercept a. 
The full model is given by 6 4 0, 
Yi = at P- ai + &, 
and has n — 2 degrees of freedom, since there are two parameters (a, 3). 
The SS(reduced) equals 
S'S(reduced) = Sy: — 4)? = total variation. 
i=1 
The S'S(full) equals 
SS (full) = Soi — ¥;)? = RSS = unexplained variation. 
i=1 
The F-test is therefore, from (3.45), 
(total variation - unexplained variation) /1 


FS A 
(unexplained variation)/(n — 2) i346) 


_ explained variation (3.47) 
(unexplained variation)/(n — 2)’ 


Using the estimators @ and B the explained variation is: 


dG 9) = S) (a+ 3x -9)" 
= 3 {GB + Bes 9} 


= 3 Pla, _ zt)? 
i=1 


= B’nsxx. 
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From (3.32) the F-ratio (3.46) is therefore: 


_ B’nsxx 
ss RSS/(n —2) (3.48) 


(=| | (3.49) 
SE(G) 


The t-test statistic (3.33) is just the square root of the F- statistic (3.49). 
Note, using (3.39) the F-statistic can be rewritten as 

a | 
(1—1?)/(n — 2)" 
In the pullover Example 3.11, we obtain F’ = aes = (0.2305, so that the null hypothesis 
@ = 0 cannot be rejected. We conclude therefore that there is only a minor influence of 
prices on sales. 


ae Summary 


<> Simple ANOVA models an output Y as a function of one factor. 


F= 


<+ The reduced model is the hypothesis of equal means. 


<— The full model is the alternative hypothesis of different means. 


<< The F-test is based on a comparison of the sum of squares under the full 
and the reduced models. 


<< The degrees of freedom are calculated as the number of observations minus 
the number of parameters. 


<< The F-statistic is 
a {S'S (reduced) — SS(full)}/{df(r) — df(f)} 
SS (full) /df(f) . 


i 


— The F-test rejects the null hypothesis if the F-statistic is larger than the 
95% quantile of the Faf(r)—df(f),4f(f) distribution. 


<+ The F-test statistic for the slope of the linear regression model y; = a@ + 
Gx; +; is the square of the t-test statistic. 
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3.6 Multiple Linear Model 


The simple linear model and the analysis of variance model can be viewed as a particular 
case of a more general linear model where the variations of one variable y are explained by p 
explanatory variables x respectively. Let y(n x 1) and ¥ (n x p) be a vector of observations 
on the response variable and a data matrix on the p explanatory variables. An important 
application of the developed theory is the least squares fitting. The idea is to approximate 
y by a linear combination y of columns of 4, i.e., y € C(A). The problem is to find 6 € R? 
such that 7 = ¥ B is the best fit of y in the least-squares sense. The linear model can be 
written as 


y= AP+6, (3.50) 


where € are the errors. The least squares solution is given by G: 


B= arg min (y— *8)' (y— XB) = arg min ele. (3.51) 


Suppose that (V') is of full rank and thus invertible. Minimizing the expression (3.51) 
with respect to (@ yields: _ 
BS(AIX) CA we. (3.52) 


The fitted value 7 = XB = X(X'X)1XTy = Py is the projection of y onto C(X) as 
computed in (2.47). 


The least squares residuals are 
e=y—-Y=y—A8 = Qy=Z,—P)y- 


The vector e is the projection of y onto the orthogonal complement of C(%). 


REMARK 3.5 A linear model with an intercept a can also be written in this framework. 
The approximating equation is: 


y= at Pitat...+ Ppp ter; t=1,...,n. 
This can be written as: 
y= XR" +e 
where X* = (1, Y) (we add a column of ones to the data). We have by (3.52): 


a~ 


B es (5) = (VV) YT, 


EXAMPLE 3.15 Let us come back to the “classic blue” pullovers example. In Example 3.11, 
we considered the regression fit of the sales X, on the price X2 and concluded that there was 
only a small influence of sales by changing the prices. A linear model incorporating all three 
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variables allows us to approximate sales as a linear function of price (X2), advertisement 
(X3) and presence of sales assistants (X4) simultaneously. Adding a column of ones to the 
data (in order to estimate the intercept a) leads to 


@ = 65.670 and 3, = —0.216, B = 0.485, B; = 0.844. 


The coefficient of determination is computed as before in (3.40) and is: 


- 
“= 0.907. 
2 WD) 


We conclude that the variation of X1 1s well approximated by the linear relation. 


rP=l— 


REMARK 3.6 The coefficient of determination is influenced by the number of regressors. 
For a given sample size n, the r? value will increase by adding more regressors into the 
linear model. The value of r? may therefore be high even if possibly irrelevant regressors are 
included. A corrected coefficient of determination for p regressors and a constant intercept 
(p+1 parameters) is 
2 
2g TD (3.53) 
n— (pd) 


EXAMPLE 3.16 The corrected coefficient of determination for Example 3.15 is 


3(1 — 0.9072) 

2. H00r 

"adj 10—3=1 
= 0.818. 


This means that 81.8% of the variation of the response variable is explained by the explanatory 
variables. 


Note that the linear model (3.50) is very flexible and can model nonlinear relationships 
between the response y and the explanatory variables x. For example, a quadratic relation 
in one variable x could be included. Then y; = a + 6,2; + Gor? + €; could be written in 
matrix notation as in (3.50), y= ¥6+. where 


Lay 


1 a 22 
fe ae 
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Properties of B 


When y; is the i-th observation of a random variable Y, the errors are also random. Under 
standard assumptions (independence, zero mean and constant variance 07), inference can be 
conducted on (3. Using the properties of Chapter 4, it is easy to prove: 

E(@)=6 


a~ 


Var(@) = 0? (41 X)71. 


The analogue of the t-test for the multivariate linear regression situation is 


~ 


=. 
SE(G;) 


The standard error of each coefficient 8; is given by the square root of the diagonal elements 


of the matrix Var(3). In standard situations, the variance o? of the error € is not known. 


One may estimate it by 
1 
a2 OT i 
oo = ——([y - —¥), 
no (pth ) Y¥-9%) 
where (p +1) is the dimension of 3. In testing 3; = 0 we reject the hypothesis at the 
significance level a if |t] > ti-a/2:n—(p41). More general issues on testing linear models are 


addressed in Chapter 7. 


The ANOVA Model in Matrix Notation 


The simple ANOVA problem (Section 3.5) may also be rewritten in matrix terms. Recall the 
definition of a vector of ones from (2.1) and define a vector of zeros as 0,. Then construct 
the following (n x p) matrix, (here p = 3), 


Im Om Om 
One Oe. clin 


where m = 10. Equation (3.41) then reads as follows. 


The parameter vector is G = (p11, fz, 3)'. The data set from Example 3.14 can therefore 
be written as a linear model y = 13+ €e where y € R” with n = m- p is the stacked vector 
of the columns of Table 3.5. The projection into the column space C'(¥) of (3.54) yields the 
least-squares estimator 3 = (¥'X)-1XTy. Note that (¥'2X)~! = (1/10)Z3 and that #Ty = 
(106, 124,151)" is the sum 5>;"., yg; for each factor, ie., the 3 column sums of Table 3.5. 
The least squares estimator is therefore the vector By, = (jl1, fiz, fi3) = (10.6, 12.4, 15.1)" 
of sample means for each factor level 7 = 1,2,3. Under the null hypothesis of equal mean 
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values 41 = [l2 = [43 = pl, we estimate the parameters under the same constraints. This can 
be put into the form of a linear constraint: 


—Hi + fe 
—[y+p3 = 


(3) 
Gr) 


The constrained least-squares solution can be shown (Exercise 3.24) to be given by: 


This can be written as AG = a, where 


and 


i= ba = A AR OA Ase = 0). (3.55) 


It turns out that (3.55) amounts to simply calculating the overall mean y = 12.7 of the 
response variable y: Gy, = (12.7, 12.7, 12.7)". 


The F-test that has already been applied in Example 3.14 can be written as 


_ {lly = 86 r0|? — lly — #6 |P}/2 
Ily — ¥ 8x, |? /27 


which gives the same significant value 8.78. Note that again we compare the RSS}, of the 
reduced model to the RSS}, of the full model. It corresponds to comparing the lengths of 
projections into different column spaces. This general approach in testing linear models is 
described in detail in Chapter 7. 


ae Summary 


<— The relation y = ¥(G +e models a linear relation between a one- 
dimensional variable Y and a p-dimensional variable X. Py gives the 
best linear regression fit of the vector y onto C(4#). The least squares 
parameter estimator is @ = (¥7X)1¥Ty. 


F (3.56) 


<+ The simple ANOVA model can be written as a linear model. 


~<+ The ANOVA model can be tested by comparing the length of the projec- 
tion vectors. 
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Summary (continued) 


<> The test statistic of the F-Test can be written as 


{lly — ¥B rll? — lly = €B mn |PY/{aF©) — ofA) 
Ily — © 6x, |? /df(S) 


<— The adjusted coefficient of determination is 


_ 2 
2 r p( r) 


Padj _ 


Ds 1), 
3.f Boston Housing 
xX x median(X) Var(X)_ std(X) 
Xy 3.61 0.26 73.99 8.60 
X92 11.36 0.00 543.94 23.32 
X3 11.14 9.69 47.06 6.86 
X4 0.07 0.00 0.06 0.25 
X5 0.55 0.54 0.01 0.12 
X¢6 6.28 6.21 0.49 0.70 
X7 ~~ 68.57 77.50 192.36 28.15 
Xs 3.79 3.21 4.43 abl 
Xo 9.55 5.00 79.82 8.71 
Xi9 408.24 330.00 28405.00 168.54 
Xi,  ~=18.46 19.05 4.69 2.16 
Xj2 356.67 391.44 8334.80 91.29 
X13 «12.65 11.36 50.99 7.14 
Xu 22.53 21.20 8459 9.20 


Table 3.9. Descriptive statistics for the Boston Housing data set. 
Q MVAdescbh.xpl 


The main statistics presented so far can be computed for the data matrix ¥ (506 x 14) from 
our Boston Housing data set. The sample means and the sample medians of each variable 
are displayed in Table 3.9. The table also provides the unbiased estimates of the variance 
of each variable and the corresponding standard deviations. The comparison of the means 
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and the medians confirms the assymmetry of the components of ¥ that was pointed out in 
Section 1.8. 
The (unbiased) sample covariance matrix is given by the following (14 x 14) matrix S,: 


73.99 —40.22 23.99-0.12 0.42 —1.33 85.41 -6.88 46.85 844.82 5.40 —302.38 27.99 —30.72 
—40.22 543.94 —85.41—0.25—-1.40 5.11—373.90 32.63 —63.35 —1236.45—-19.78 373.72 —68.78 77.32 
23.99 —85.41 47.06 0.11 0.61 —1.89 124.51 —10.23 35.55 833.36 5.69 —223.58 29.58 —30.52 
—0.12 —0.25 0.11 0.06 0.00 0.02 0.62 0.05 0.02 1.52 —0.07 1.13 —0.10 0.41 
0.42 —1.40 0.61 0.00 0.01 —0.02 2.39 —0.19 0.62 13.05 0.05 —4.02 0.49 —0.46 
—1.33 5.11 —1.89 0.02—0.02 0.49 —4.75 0.30 —1.28 —34.58 —0.54 8.22 —3.08 4.49 
85.41 —373.90 124.51 0.62 2.39 —4.75 792.36 —44.33 111.77 2402.69 15.94 —702.94 121.08 —97.59 
—6.88 32.63 —10.23—0.05—0.19 0.30 —44.33 4.43 —9.07 —189.66 —1.06 56.04 —7.47 4.84 |’ 
46.85 —63.35 35.55—0.02 0.62 —1.28 111.77 —9.07 75.82 1335.76 8.76 —353.28 30.39 —30.56 
844.82 —1236.45 833.36 —1.52 13.05 —34.58 2402.69 —189.66 1335.76 28404.76 168.15 —6797.91 654.71 —726.26 
5.40 —19.78 5.69 —0.07 0.05 —0.54 15.94 —1.06 8.76 168.15 4.69 —35.06 5.78 —10.11 
—302.38 373.72 —223.58 1.13—4.02 8.22—702.94 56.04 —353.28 —6797.91 —35.06 8334.75 —238.67 279.99 
27.99 —68.78 29.58—0.10 0.49 —3.08 121.08 —7.47 30.39 654.71 5.78 —238.67 50.99 —48.45 
—30.72 77.32 —30.52 0.41—0.46 4.49 —97.59 4.84 —30.56 —726.26—10.11 279.99 —48.45 84.59 


and the corresponding correlation matrix R(14 x 14) is: 


1.00 —0.20 0.41—0.06 0.42—0.22 0.35—0.38 0.63 0.58 0.29—-0.39 0.46 —0.39 
—0.20 1.00 —0.53 —0.04—0.52  0.31—0.57 0.66 —0.31 —0.31—0.39 0.18—0.41 0.36 
0.41—0.53 1.00 0.06 0.76—0.39 0.64—0.71 0.60 0.72 0.38—0.36 0.60 —0.48 
—0.06 —0.04 0.06 1.00 0.09 0.09 0.09 —0.10 —0.01—0.04—0.12 0.05—0.05 0.18 
0.42 —0.52 0.76 0.09 1.00—0.30 0.73—-0.77 0.61 0.67 0.19—0.38 0.59 —0.43 
—0.22 0.31—0.389 0.09—0.30 1.00—0.24 0.21 —0.21—0.29—0.36 0.13—0.61 0.70 
0.35—0.57 0.64 0.09 0.73—0.24 1.00—-0.75 0.46 0.51 0.26—0.27 0.60 —0.38 
—0.38 0.66 —0.71 —0.10—0.77 0.21—0.75 1.00 —0.49 —0.53 —0.23 0.29—0.50 0.25 
0.63 —0.31 0.60—0.01 0.61—0.21 0.46—0.49 1.00 0.91 0.46—0.44 0.49 —0.38 
0.58 —0.31 0.72 —0.04 0.67—0.29 0.51—-0.53 0.91 1.00 0.46—0.44 0.54—0.47 
0.29 —0.389 0.38—0.12 0.19—-0.36 0.26—-0.23 0.46 0.46 1.00—0.18 0.37—0.51 
—0.39 0.18—0.36 0.05—0.38 0.13—0.27 0.29 —0.44—0.44—0.18 1.00—0.37 0.33 
0.46 —0.41 0.60—0.05 0.59—-0.61 0.60—0.50 0.49 0.54 0.37—0.37 1.00 —0.74 
—0.39 0.36—0.48 0.18 —0.43 0.70—0.38 0.25 —0.38 —0.47 —0.51 0.33—0.74 1.00 


Analyzing R confirms most of the comments made from examining the scatterplot matrix 
in Chapter 1. In particular, the correlation between X 4 (the value of the house) and all 
the other variables is given by the last row (or column) of R. The highest correlations (in 
absolute values) are in decreasing order X13, X6, X11, X10, etc. 


Using the Fisher’s Z-transform on each of the correlations between X 4 and the other vari- 
ables would confirm that all are significantly different from zero, except the correlation 
between X14 and X, (the indicator variable for the Charles River). We know, however, that 
the correlation and Fisher’s Z-transform are not appropriate for binary variable. 


The same descriptive statistics can be calculated for the transformed variables (transforma- 
tions were motivated in Section 1.8). The results are given in Table 3.10 and as can be seen 
most of the variables are now more symmetric. Note that the covariances and the correla- 
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xX z median(X) Var(X)  std(X) 
x 20% —1.36 467 2.16 
~~ i 0.00 544 2.38 
Xe. 6 2.27 0.60 0.78 
007 0.00 0.06 0.25 
Xs =061 —0.62 0.04 0.20 
Xe 183 1.83 0.01 0.11 
X, 5.06 5.29 12.72 3.57 
xX, ‘dais 1.17 0.29 0.54 
Xe. 187 1.61 ir  te7 
Xa 5.03 5.80 0.16 0.40 
Xu 215 2.04 1.86 1.36 
Ma B57 3.91 0.83 0.91 
Xx. “BA aa7 0.97 0.99 
Xa 203 3.05 0.17 0.41 


Table 3.10. Descriptive statistics for the Boston Housing data set after 
the transformation. @ MVAdescbh.xpl 


tions are sensitive to these nonlinear transformations. For example, the correlation matrix 
is now 


1.00 —0.52 0.74 0.03 0.81—0.32 0.70—0.74 0.84 0.81 0.45—0.48 0.62 —0.57 
—0.52 1.00 —0.66 —0.04—0.57 0.31—0.53 0.59 —0.35 —0.31—0.35 0.18—0.45 0.36 
0.74 —0.66 1.00 0.08 0.75—0.43 0.66—0.73 0.58 0.66 0.46—0.33 0.62 —0.55 
0.03 —0.04 0.08 1.00 0.08 0.08 0.07—0.09 0.01—0.04—0.13 0.05—0.06 0.16 
0.81—0.57 0.75 0.08 1.00—0.32 0.78—0.86 0.61 0.67 0.34—0.38 0.61 —0.52 
—0.32 0.31—0.43 0.08 —0.32 1.00—0.28 0.28 —0.21 —0.31—0.32 0.13—0.64 0.61 
0.70 —0.53 0.66 0.07 0.78—0.28 1.00—0.80 0.47 0.54 0.38—0.29 0.64—0.48 
—0.74 0.59 —0.73 —0.09 —0.86 0.28—0.80 1.00 —0.54—0.60—0.32 0.32 —0.56 0.41 
0.84—0.35 0.58 0.01 0.61—0.21 0.47—-0.54 1.00 0.82 0.40—0.41 0.46 —0.43 
0.81—0.31 0.66—0.04 0.67—0.31 0.54—0.60 0.82 1.00 0.48—0.43 0.53 —0.56 
0.45 —0.385 0.46—0.13 0.34—0.32 0.388—0.32 0.40 0.48 1.00—0.20 0.43 —0.51 
—0.48 0.18—0.33 0.05—0.38 0.13—0.29 0.32 —0.41—0.43—0.20 1.00—0.36 0.40 
0.62 —0.45 0.62—0.06 0.61—0.64 0.64—0.56 0.46 0.53 0.43—0.36 1.00 —0.83 
—0.57 0.36—0.55 0.16—0.52 0.61—0.48 0.41 —0.43 —0.56—0.51 0.40 —0.83 1.00 


Notice that some of the correlations between X 14 and the other variables have increased. 
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Variable oO; SE(G;) : p-value 
constant 4.1769 0.3790 11.020 0.0000 


Be —0.0146 0.0117 1.254 0.2105 
XG 0.0014 0.0056 0.247 0.8051 
Fe —0.0127 0.0223 —0.570 0.5692 
oe 0.1100 0.0366 3.002 0.0028 
be —0.2831 0.1053  —2.688 0.0074 
Xs 0.4211 0.1102 3.822 0.0001 
Xx 0.0064 0.0049 1.317 0.1885 
Xs —0.1832 0.0368  —4.977 0.0000 
Xa 0.0684 0.0225 3.042 0.0025 
ver —0.2018 0.0484  —4.167 0.0000 
een —0.0400 0.0081  —4.946 0.0000 
Xw 0.0445 0.0115 3.882 0.0001 
vee —0.2626 0.0161 —16.320 0.0000 


Table 3.11. Linear regression results for all variables of Boston Housing 
data set. Q MVAlinregbh.xpl 


If we want to explain the variations of the price a by the variation of all the other variables 


Xj ,...,X13 we could estimate the linear model 


j=l 
The result is given in Table 3.11. 


The value of r? (0.765) and r3,; (0.759) show that most of the variance of Xj, is explained 
by the linear model (3.57). 


Again we see that the variations of Xu are mostly explained by (in decreasing order of 
the absolute value of the t-statistic) ee. eme, Cre. cre cre Cr Cee and Xs. The other 
variables x. x ve and x seem to have little influence on the variations of xu This will 
be confirmed by the testing procedures that will be developed in Chapter 7. 


3.8 Exercises 


EXERCISE 3.1 The covariance sx,x, between X4 and X;5 for the entire bank data set is 
positive. Given the definitions of X4 and Xs, we would expect a negative covariance. Using 
Figure 3.1 can you explain why sx,x, 1s positive? 
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EXERCISE 3.2 Consider the two sub-clouds of counterfeit and genuine bank notes in Fig- 
ure 3.1 separately. Do you still expect sx,x, (now calculated separately for each cloud) to be 
positive? 


EXERCISE 3.3 We remarked that for two normal random variables, zero covariance implies 
independence. Why does this remark not apply to Example 3.4? 


EXERCISE 3.4 Compute the covariance between the variables 


X_g =~ miles per gallon, 
Xg =~ weight 


from the car data set (Table B.3). What sign do you expect the covariance to have? 


EXERCISE 3.5 Compute the correlation matrix of the variables in Example 3.2. Comment 
on the sign of the correlations and test the hypothesis 


PX\X2 = 0. 


EXERCISE 3.6 Suppose you have observed a set of observations {x;}"_, with F =0, sxx = 
1 andn >>", (a; —Z)® = 0. Define the variable y; = x?. Can you immediately tell whether 
TXY x 0? 


EXERCISE 3.7 Find formulas (3.29) and (3.30) for @ and B by differentiating the objective 
function in (3.28) w.r.t. a and £. 


EXERCISE 3.8 How many sales does the textile manager expect with a “classic blue” pullover 
price of x = 105? 


EXERCISE 3.9 What does a scatterplot of two random variables look like for r? = 1 and 
r=? 


EXERCISE 3.10 Prove the variance decomposition (3.38) and show that the coefficient of 
determination is the square of the simple correlation between X and Y. 


EXERCISE 3.11 Make a bozplot for the residuals ¢; = y; — @ — Ba; for the “classic blue” 
pullovers data. If there are outlers, identify them and run the linear regression again without 
them. Do you obtain a stronger influence of price on sales? 


EXERCISE 3.12 Under what circumstances would you obtain the same coefficients from the 
linear regression lines of Y on X and of X on Y ? 
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EXERCISE 3.13 Treat the design of Example 3.14 as if there were thirty shops and not ten. 
Define x; as the index of the shop, 1.e., x; = 1,1 = 1,2,...,30. The null hypothesis is a 
constant regression line, EY = ys. What does the alternative regression curve look like? 


EXERCISE 3.14 Perform the test in Exercise 3.13 for the shop example with a 0.99 signif- 
icance level. Do you still reject the hypothesis of equal marketing strategies? 


EXERCISE 3.15 Compute an approzimate confidence interval for px,x, in Example (3.2). 
Hint: start from a confidence interval for tanh" (px, x.) and then apply the inverse trans- 
formation. 


EXERCISE 3.16 In Example 3.2, using the exchange rate of 1 EUR = 106 JPY, compute 
the same empirical covariance using prices in Japanese Yen rather than in Euros. Is there 
a significant difference? Why? 


EXERCISE 3.17 Why does the correlation have the same sign as the covariance? 
EXERCISE 3.18 Show that rank(H) = tr(H) =n—-1. 


EXERCISE 3.19 Show that X%, = HXD='/2 is the standardized data matriz, i.e., 
i, =O and od, = Rx 


EXERCISE 3.20 Compute for the pullovers data the regression of X, on X2, X3 and of X1 
on X2,X4. Which one has the better coefficient of determination? 


EXERCISE 3.21 Compare for the pullovers data the coefficient of determination for the 
regression of X, on Xq (Example 3.11), of X; on Xo, X3 (Exercise 3.20) and of X1 on 
X_,X3,X4 (Example 3.15). Observe that this coefficient is increasing with the number of 
predictor variables. Is this always the case? 


EXERCISE 3.22 Consider the ANOVA problem (Section 3.5) again. Establish the con- 
straint Matriz A for testing 4. = fla. Test this hypothesis via an analog of (3.55) and 
(20): 


EXERCISE 3.23 Prove (3.52). (Hint, let f(G) = (y—2x8)'(y— 28) and solve of) =()). 


EXERCISE 3.24 Consider the linear model Y = XG +e where B = are mine'e is subject 


to the linear constraints AB =a where A(q x p),(q < p) is of rank q and a is of dimension 
(qx1). Show that 9 = Bozg— (XTX) AT (A(ATX)“AT)* (ABO s — @) where Borg = 


(V'X)1X'y. (Hint, let f(B,r) = (y—2B)' (y—2B) — "(AG —a) where  € R¢ and solve 
28) _ 9 gnd 2A — 9) 
op On 
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EXERCISE 3.25 Compute the covariance matriz S = Cov(#) where X denotes the matrix 
of observations on the counterfeit bank notes. Make a Jordan decomposition of S. Why are 
all of the eigenvalues positive? 


EXERCISE 3.26 Compute the covariance of the counterfeit notes after they are linearly 
transformed by the vector a = (1,1,1,1,1,1)'. 


4 Multivariate Distributions 


The preceeding chapter showed that by using the two first moments of a multivariate dis- 
tribution (the mean and the covariance matrix), a lot of information on the relationship 
between the variables can be made available. Only basic statistical theory was used to de- 
rive tests of independence or of linear relationships. In this chapter we give an introduction 
to the basic probability tools useful in statistical multivariate analysis. 


Means and covariances share many interesting and useful properties, but they represent 
only part of the information on a multivariate distribution. Section 4.1 presents the basic 
probability tools used to describe a multivariate random variable, including marginal and 
conditional distributions and the concept of independence. In Section 4.2, basic properties 
on means and covariances (marginal and conditional ones) are derived. 


Since many statistical procedures rely on transformations of a multivariate random variable, 
Section 4.3 proposes the basic techniques needed to derive the distribution of transformations 
with a special emphasis on linear transforms. As an important example of a multivariate 
random variable, Section 4.4 defines the multinormal distribution. It will be analyzed in 
more detail in Chapter 5 along with most of its “companion” distributions that are useful 
in making multivariate statistical inferences. 


The normal distribution plays a central role in statistics because it can be viewed as an 
approximation and limit of many other distributions. The basic justification relies on the 
central limit theorem presented in Section 4.5. We present this central theorem in the frame- 
work of sampling theory. A useful extension of this theorem is also given: it is an approximate 
distribution to transformations of asymptotically normal variables. The increasing power of 
the computers today makes it possible to consider alternative approximate sampling dis- 
tributions. These are based on resampling techniques and are suitable for many general 
situations. Section 4.6 gives an introduction to the ideas behind bootstrap approximations. 
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4.1 Distribution and Density Function 


Let X = (Xi, Xo,...,X,)' be a random vector. The cumulative distribution function (cdf) 
of X is defined by 


F(a) = P(X <2) = P(X < 41, X2 < a2,..., Xp < ap). 


For continuous X, there exists a nonnegative probability density function (pdf) f, such that 


Ee) = is f(u)du. (4.1) 


Note that 


[. ta)de =, 


Most of the integrals appearing below are multidimensional. For instance, i aon f(u)du means 
frre [EX fur, ...,Up)du; +++ duy. Note also that the cdf F is differentiable with 


_ OP F(a) 
Ox + + Oy 


f (2) 


For discrete X, the values of this random variable are concentrated on a countable or finite 
set of points {c;}j;e,, the probability of events of the form {X € D} can then be computed 
as 
P(XED)= J) P(X =q). 
{j:cj€D} 


If we partition X as X = (X1, X2)' with X, € R* and X2 € R®*, then the function 
Fy, (21) = P(X, mS £1) = F(ru, 205 U1k, CO,---, 00) (4.2) 


is called the marginal cdf. F = F(x) is called the joint cdf. For continuous X the marginal 
pdf can be computed from the joint density by “integrating out” the variable not of interest. 


fra(es) =f flere) (4.3) 
The conditional pdf of X2 given X, = 2 is given as 
f(x, £2) 
02) Lt) =a 4.4 
ee fx, (a1) (44) 


EXAMPLE 4.1 Consider the pdf 


1 3 
=U, + 5X9 052i to: 1, 
= 2 2 
f(a, 22) { 0 otherwise. 
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f(x1, 2) is a density since 


peel «Bae. TB 
[ fer.22)devder 2 FI, 2 FI, a 4 


The marginal densities are 


3 if 3 
fx,(@1) = ja £1,X2)dx_ = i (50 ar 5] dx2 an See 
3 3 1 
fxo(Z2) = ca ©1,22)dx, = i (50 “ar 5] dx, a2 Weg 
The conditional densities are therefore 
1 3 1 3 
501 + ah2 5% + a2 
—— i a ne d = 
f (2 | £1) Tr +3 and f (x1 | £2) Spy 1 


Note that these conditional pdf’s are nonlinear in x, and x2 although the joint pdf has a 
simple (linear) structure. 


Independence of two random variables is defined as follows. 
DEFINITION 4.1 X, and X2 are independent iff f(x) = f(x1,%2) = fx, (11) fx, (x2). 


That is, X; and X2 are independent if the conditional pdf’s are equal to the marginal densi- 
ties, i.e., f(x1 | 22) = fx, (a1) and f(x | 71) = fx,(x2). Independence can be interpreted as 
follows: knowing X2 = x2 does not change the probability assessments on X,, and conversely. 


JN Different joint pdf’s may have the same marginal pdf’s. 


EXAMPLE 4.2 Consider the pdf’s 
f (#1, £2) = 1, O0< U1,%Q< 1, 


and 
f(@1,%2) =1+a(2x,-—1)(2%2-1), O< 21, <1, -l<a<l. 


We compute in both cases the marginal pdf’s as 
fxi(ti)=1, fx. (v2) = 1. 
Indeed , 
/ 1+ a(2x, — 1)(2r2 — 1)dzg = 14+ a(221 — 1)[23 — x9]5 = 1. 
0 


Hence we obtain identical marginals from different joint distributions! 


122 4 Multivariate Distributions 


Swiss bank notes Swiss bank notes 


8.5 9 if Bs 9 10 


lower inner frame (X4) upper inner frame (X5) 


Figure 4.1. Univariate estimates of the density of X4 (left) and X5 (right) 
of the bank notes. Q MVAdenbank2.xp1l 


Let us study the concept of independence using the bank notes example. Consider the 
variables X4 (lower inner frame) and X; (upper inner frame). From Chapter 3, we already 
know that they have significant correlation, so they are almost surely not independent. 
Kernel estimates of the marginal densities, fx, and fx,, are given in Figure 4.1. In Figure 
4.2 (left) we show the product of these two densities. The kernel density technique was 
presented in Section 1.3. If X, and X5 are independent, this product i : Cs should be 
roughly equal to flata, x5), the estimate of the joint density of (X4, X5). Comparing the two 
graphs in Figure 4.2 reveals that the two densities are different. The two variables X4 and 
Xs are therefore not independent. 


An elegant concept of connecting marginals with joint cdfs is given by copulas. Copulas 
are important in Value-at-Risk calculations and are an essential tool in quantitative finance 
(Hardle, Kleinow and Stahl, 2002). 


For simplicity of presentation we concentrate on the p = 2 dimensional case. A 2-dimensional 
copula is a function C' : [0,1]? — [0,1] with the following properties: 


e For every u € (0, 1]: C(0,u) = C(u, 0) = 0. 
e For every u € [0,1]: C(u, 1) =u and C(1, u) = u. 
e For every (t, U2), (V1, V2) € [0,1] x [0,1] with uy < v, and ug < vo: 


C(v1, v2) == C(v1, U2) = C(u1, v2) + C(ur, U2) = 0. 
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Swiss bank notes Swiss bank notes 


Figure 4.2. The product of univariate density estimates (left) and the 
joint density estimate (right) for X4 (left) and X5 of the bank notes. 
Q MVAdenbank3.xpl 


The usage of the name “copula” for the function C' is explained by the following theorem. 


THEOREM 4.1 (Sklar’s theorem) Let F' be a joint distribution function with marginal 
distribution functions Fx, and F'x,. Then there exists a copula C’ with 


F (21, £2) = C{ Fx, (21), Fx, (£2) } (4.5) 


for every %1,%2 € R. If Fx, and Fx, are continuous, then C' is unique. On the other hand, 
if C is a copula and Fx, and Fx, are distribution functions, then the function F' defined 
by (4.5) is a joint distribution function with marginals Fy, and F,. 


With Sklar’s Theorem, the use of the name “copula” becomes obvious. It was chosen to de- 
scribe “a function that links a multidimensional distribution to its one-dimensional margins” 
and appeared in the mathematical literature for the first time in Sklar (1959). 


EXAMPLE 4.3 The structure of independence implies that the product of the distribution 
functions Fx, and Fx, equals their joint distribution function F,, 


F(x, £2) — Fx, (21) . Pry, (£2). (4.6) 


Thus, we obtain the independence copula C' = II from 


(tig <atia h= he . 
i=l 
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THEOREM 4.2 Let X, and X2 be random variables with continuous distribution functions 
Fy, and Fx, and the joint distribution function F. Then X, and X2 are independent if and 
only if Cx, x, = IL 


Proof: 
From Sklar’s Theorem we know that there exists an unique copula C’ with 


P(X, << %1, X92 — £2) = P(g aia) => C{F x, (21), Fx, (x2)} . (4.7) 


Independence can be seen using (4.5) for the joint distribution function F’ and the definition 
of II, 
F(a, £2) = CLPx, (#1), x2 (2) } = Fx (@1) x2 (#2) - (4.8) 


O 


EXAMPLE 4.4 The Gumbel-Hougaard family of copulas (Nelsen, 1999) is given by the func- 
tion 
Co(u, v) = exp {- [(— Inu)? + (—Inv)?] ue (4.9) 


The parameter 0 may take all values in the interval |1,00). The Gumbel-Hougaard copulas 
are suited to describe bivariate extreme value distributions. 


For 6 = 1, the expression (4.9) reduces to the product copula, t.e., Ci(u,v) = H(u,v) = uv. 
For @ — oo one finds for the Gumbel-Hougaard copula: 

Co(u, v)— min(u, v) = M(u, v), 
where the function M is also a copula such that C(u,v) < M(u,v) for arbitrary copula C. 
The copula M is called the Fréchet-Hoeffding upper bound. 


Similarly, we obtain the Fréchet-Hoeffding lower bound W(u,v) = max(u + v — 1,0) which 
satisfies W(u,v) < C(u,v) for any other copula C. 


ar Summary 


<+ The cumulative distribution function (cdf) is defined as F(x) = P(X < x). 
<= Ifa probability density function (pdf) f exists then F(x) = J”. f(u)du. 
<> The pdf integrates to one, ie., [~ f(x)dz = 1. 
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Summary (continued) 


— Let X = (X1, X2)' be partitioned into sub-vectors X; and X»_ with joint 
cdf F. Then Fx,(%1) = P(X, < 21) is the marginal cdf of X,. The 
marginal pdf of X, is obtained by fx, (1) = | poet f(@1,%2)dx. Different 
joint pdf’s may have the same marginal pdf’s. 


<+ The conditional pdf of Xj given X; = 2, is defined as f(x | 41) = 
I Ris 2) 
fx, (21) 


<+ Two random variables X,; and Xp» are called independent iff 
f (#1, %2) = fx, (11) fx, (x2). This is equivalent to f(x2 | x1) = fx, (£2). 


<— Different joint pdf’s may have identical marginal pdf’s. 


4.2 Moments and Characteristic Functions 


Moments—Expectation and Covariance Matrix 


If X is a random vector with density f(x) then the expectation of X is 


EX, f af (x)dx 
EX = ; = [eta = = Gi: (4.10) 
EX, J apf (x)dx 


Accordingly, the expectation of a matrix of random elements has to be understood component 
by component. The operation of forming expectations is linear: 


E(aX + BY) =aEX + BEY. (4.11) 


If A(q x p) is a matrix of real numbers, we have: 


E(AX) = AEX. (4.12) 
When X and Y are independent, 
BCXY") = PX By *: (4.13) 
The matrix 
Var(X) =E=E(X —p)(X—p)! (4.14) 


is the (theoretical) covariance matrix. We write for a vector X with mean vector y and 
covariance matrix %, 


Key), (4.15) 
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The (p x q) matrix 
Uxy = Cov(X,Y) = E(X — p)(Y —v)" (4.16) 


is the covariance matrix of X ~ (u,Uxx) and Y ~ (v,Xyy). Note that Nxy = Uy, and 


that Z = (*) has covariance Nzz = ( =X =xY )_ From 
y Nyx Syy 


Cou(X,Y) = E(XY") -— py! = E(XY"')- EXEY' (4.17) 


it follows that Cov(X,Y) = 0 in the case where X and Y are independent. We often say 
that = E(X) is the first order moment of X and that E(X X') provides the second order 
moments of X: 


BOO Vath GG x \hodor tap and 7. 1m, (4.18) 


Properties of the Covariance Matrix © = Var(X) 


= (0x,x,), Ox,x, = Cou(Xi, Xj), ox,x, = Var( Xi) (4.19) 
VSR RX Sp (4.20) 
0 (4.21) 


Properties of Variances and Covariances 


Var(a' X) =a! Var(X)a = Ss" AjAjOX,X; (4.22) 
tJ 

Var (AX +b) = A Var(X)A' (4.23) 

Cou(X + Y, Z) = Cou(X, Z) + Cov(Y, Z) (4.24) 

Var(X + Y) = Var(X) + Cou(X, Y) + Cov(Y, X) + Var(Y) (4.25) 

Cov(AX, BY) = A Cov(X,Y)B". (4.26) 


Let us compute these quantities for a specific joint density. 
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EXAMPLE 4.5 Consider the pdf of Example 4.1. The mean vector pp = es, 18 
joan = f fas L1, £2) \dx dx = [ [ ni (5 y+ =) dx dx 
1 3 lia 3/2 
= fn (Ge+d)en~a [3] +a [8] 
: ( 4 213], 412], 
oil : _4+9 13 
2 «24’ 
1 3 
[2 fan L1,L2)dx,dxq = [ h Lo (S04 57] dx dx 
o Jo 2 2 
ee 1 fx2]* 3 f23]! 
es kage ee dr5 = 
f= (ar5e)eo= 7/8] +a, 
2, eg +4 #5 
~ 8 2 8 8 
The elements of the covariance matrix are 
OxixX, = EX? - 2 with 
oy eee Gl 3 1fet]* 3f23]* 3 
EX [2 (5 + 52] dx \dx_q = a a a 
: Je 2 214], 413], 8 
OXoXo = EX? — pé with 
ae caer cal 3 1f23]' 3fe4}? 
EX? = a bad dz = 2 2) _ 
B= ff [Get gn) dente 73] 5B] 
OX;X, = E(X,X9) — — Mi fh2 with 
1 3 
E(X,X2) = h An (on =) dxydz2 = [ (Ge + 33) dx» 
a | a 1 
6 3 0 413], 3 
Hence the covariance matrix is 
he 0.0815 0.0052 
~ \ 0.0052 0.0677 }° 
Conditional Expectations 
The conditional expectations are 
E(X» | 11) = ic | r1) dx» and E(X, | L2) = fore | £2) dx. (4.27) 


128 4 Multivariate Distributions 


E(X9|x1) represents the location parameter of the conditional pdf of X2 given that X1 = 7). 
In the same way, we can define Var(X2|X, = 21) as a measure of the dispersion of X2 given 
that X, = 21. We have from (4.20) that 


Var (Xo|X, = 21) = E(X_ XJ |X_ = 21) — E(X2|X1 = 21) E(XJ |X. = 21). 
Using the conditional covariance matrix, the conditional correlations may be defined as: 


Cou(Xe2, X3|X1 = £1) 
J Var(X2|X1 = r1) Var (X3|X1 = 11) 


PX X3|Xi=a1 = 


These conditional correlations are known as partial correlations between X»2 and X3, condi- 
tioned on X, being equal to 71. 


EXAMPLE 4.6 Consider the following pdf 
2 
fii ts) = 3(F1 + 22+ 23) where 0 < 1, 22,23 <1. 


Note that the pdf is symmetric in 21,%_ and x3 which facilitates the computations. For 
instance, 

f (#1, £2) = 2 (a1 + 22+ 5) O0<41,% <1 

f(z.) = 2(¢,+1) 0<2<1 
and the other marginals are similar. We also have 


ty+%2+ 23 


; = : 0<%1,%2<1 
f(zi ©9|x3) aed U1,22 
XY + L3 + 5 
= —___—, O0< <1. 
f(x1|@3) anes Ly 


It is easy to compute the following moments: 


E(X;) = 8; E(X?)=4; B(X,X;)=1 (i #9 andi,j = 1,2,3) 


t 


E(X7|X3 = 23) = E(X3|X3 = 23) = iy tea | 
and 
E(X1X9|X3 = x3) = 5 Ge : 


Note that the conditional means of X, and of Xo, given X3 = x3, are not linear in x3. From 
these moments we obtain: 


13 1 1 
162 324 324 1 
y= ~ 394 163 ~ Ba mn particular PX1X. = — 56 ~ —0.0385. 


324 324 162 
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The conditional covariance matrix of X, and X2, given X3 = x3 1s 


Xx 1202424¢3+11 2 
1 = = 144(a3+1)? 144(23+1)? 
Var (e | X3 = vs) = 4 pen ai ; 


144(x2341)? 144(2341)? 


In particular, the partial correlation between X, and Xo, given that X3 is fixed at x3, is given 
by PX, X2|X3=a3 = =eeoaeat which ranges from —0.0909 to —0.0213 when x3 goes from 0 
to 1. Therefore, in this example, the partial correlation may be larger or smaller than the 
simple correlation, depending on the value of the condition X3 = x3. 


EXAMPLE 4.7 Consider the following joint pdf 
f (#1, £2, 23) = 2%2(%1 + 43); 0 < £1, 22,23 < 1. 


Note the symmetry of x; and x3 in the pdf and that X2 is independent of (X1,X3). It 
immediately follows that 


f (#1, £3) = (xy + £3) O< U1,%3 < 1 


f(a) = 1+ 


—. 
8 
nee 
| 


229; 


—. 
8 
2 
I 
8 

w 
+ 
| 


Simple computations lead to 


<i 
Af 
ce 


1 
; iad . err 
E(X) = 3 and Ss = A 18 ee 
ia Sia Ota 
12 


Let us analyze the conditional distribution of (X1,X2) given X3 = x3. We have 
A(x + 3)XQ 


f L\X3) = OA toe 
f (ai|x3) = 2 : 0<a,<1 
1|%3 5 7 1 


f (x2|v3) = f(v2) = 2x5 O0<a%<1 


so that again X, and X_ are independent conditional on X3 = x3. In this case 


xX 1 (#3) 
_ 3 \ 142 
e((R)isen) = (AGE 
1 { 6x2+623+1 
Var (i) |X3 = vs) ( 1g ( Geet? ) 
Xo 


0 i 


te fon) 
ie) 
Se 
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Properties of Conditional Expectations 


Since E(X2|X, = 21) is a function of x71, say h(x1), we can define the random variable 
h(X1) = E(X2|X1). The same can be done when defining the random variable Var(X2|X1). 
These two random variables share some interesting properties: 


E(X2) 
Var (X9) 


B{E(X2|X1)} (4.28) 
E{ Var(X2|X1)} + Var{ B(X2|X1)}. (4.29) 


EXAMPLE 4.8 Consider the following pdf 
I Gig to) = e771: 0<2,<1,72>0. 


It is easy to show that 


2 1 
a) = 207. for: Oc ay de EX) 3 and Var(X1) = is 


1 _22 
f(xe|a1) = —e-™ for a2 >0; E(X2|X1) =X, and Var(Xs|X,) = X2. 
Ly 


Without explicitly computing f(x2), we can obtain: 


F(X) = E(E(X2|X1)) = E(X1) = 5 
Var(X) = E(Var(Xs|X1)) + Var (E(X9|X1)) = E(X2) + Var(X,) = : fe = a 2 


The conditional expectation E(X 2|X1) viewed as a function h(X1) of X1 (known as the 
regression function of X2 on X,), can be interpreted as a conditional approximation of X» 
by a function of X,. The error term of the approximation is then given by: 


U = Xq— E(X2|X1). 


THEOREM 4.3 Let X; € R* and X2 € R’-* and U = Xq — E(X2|X1). Then we have: 


(1) E(U) =0 


(2) E(X2|X1) is the best approximation of Xz by a function h(X,) of X, where hh: R* — 
R?-*. “Best” is the minimum mean squared error (MSE), where 


MSE(h) = E[{X2 — A(X1)}! {Xo — h(X1)}). 
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Characteristic Functions 


The characteristic function (cf) of a random vector X € R? (respectively its density f(z)) 
is defined as 


T 


yx(t) = E(e#'*) = fe 


where i is the complex unit: i? = —1. The cf has the following properties: 


“f(x) dx, te R?, 


yx(0)=1 and |px(t)| <1. (4.30) 


If y is absolutely integrable, i.e., the integral [™. |y(x)|da exists and is finite, then 


fei = aa / 7 ent ay, (8) de. (4.31) 
lf X = (XG Xo, 2004 X_)", then tor = (ta, to).2-4%,)" 
Pel) Sexy Q.eg Dy. os 9x) —Gx0,..2,0,4). (4.32) 
If X,,...,X, are independent random variables, then for t = (ti, ta,..., tp)! 
px(t) = px, (t1)-...- Px, (tp). (4.33) 


If X,,...,Xp are independent random variables, then for t € R 


x14...4Xp(t) = px, (t)-...- yx, (t). (4.34) 


The characteristic function can recover all the cross-product moments of any order: V7, > 
0,k =1,...,p and for t = (t1,...,t,)' we have 


B(X#.....X%) = Peed . (4.35) 


eee NO ee 


EXAMPLE 4.9 The cf of the density in example 4.5 is given by 


[ [ el" F(x) dx 


1 1 
ae 1 3 
= | | {cos(t121 + toX2) + isin(t x, + toro) } (Fe + =) dx,dxq, 
0 JO 


0.5 elf (3it, = 3iei® ty + ite = ici? to + ty to = A eit ty tz) 
iy to? 
0.5 (Sit, — Sie! t) + ity — ie! ty — 3 el? tj te) 


i? is 


px(t) 
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pdf 


Uniform I(x € [a, b])/(b — a) 


i) =e — e*)/(6 — ait 


Ni(u,07) f(x) = (2007) "Pexp{—(x — 1)?/207} 


x t eivt— ot? /2 


f(x) = 
(2) = 

x"(n) f(a) = Ua > O)arl® he {P(n/2)2"/"} 
(2) = 


Np(H, 2) f(x) = |20d|“exp{—(x — 2) "U(a — p)/2} 


(t) = 

(t) = 
yx(t) = a — 2it)-” 
Oe ae 


Table 4.2. Characteristic functions for some common distributions. 


EXAMPLE 4.10 Suppose X € R! follows the density of the standard normal distribution 


aes) 


(see Section 4.4) then the cf can be computed via 


yx(t) = ie — =) dx 


_ _ 2 lio 
= sf ex {-3 = (a? dite +) b exp {Si dx 


- 00(-8) fo doo SP} 


#2 
-oo( 8) 


since i? = —1 and [ Fe exp {- ==! de =i, 


A variety of distributional characteristics can be computed from yx (t). The standard normal 
distribution has a very simple cf, as was seen in Example 4.10. Deviations from normal 
covariance structures can be measured by the deviations from the cf (or characteristics of 
it). In Table 4.2 we give an overview of the cf’s for a variety of distributions. 


THEOREM 4.4 (Cramer-Wold) The distribution of X € 


the set of all (one-dimensional) distributions of t'X where t € R?. 


R? is completely determined by 


This theorem says that we can determine the distribution of X in R? by specifying all of the 


one-dimensional distributions of the linear combinations 


Pp 
Set yi Ve ats 


jie) 
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Cumulant functions 


Moments m, = f «* f(x)dx often help in describing distributional characteristics. The nor- 
mal distribution in d = 1 dimension is completely characterized by its standard normal 
density f = y and the moment parameters are fp = m, and 0? = mz — m?. Another 
helpful class of parameters are the cumulants or semi-invariants of a distribution. In order 
to simplify notation we concentrate here on the one-dimensional (d = 1) case. 


For a given random variable X with density f and finite moments of order k the characteristic 
function yx(t) = E(e*) has the derivative 
1 E Ox (t) 


fel) Se Se 
w ot |. 


The values «,; are called cumulants or semi-invariants since k; does not change (for j > 
1) under a shift transformation X + X +a. The cumulants are natural parameters for 
dimension reduction methods, in particular the Projection Pursuit method (see Section 18.2). 


The relationship between the first k moments m1,...,™, and the cumulants is given by 


Ke = (-1*" : ( ) 


(4.36) 


k—-1 
(2 )m 


EXAMPLE 4.11 Suppose that k = 1, then formula (4.36) above yields 


Ky = ™y1. 
For k = 2 we obtain 
m4 1 
a mg (j, ) ms we le 
For k =3 we have to calculate 
My 1 0 
K3 = }] ™mMq My 1 


m3 mg 2m4 
Calculating the determinant we have: 


7 1 0 

ue m2 2m 
m4(2m? — mz) — m2(2m,) + ms 
m3 — 3mymM2 + 2m?. (4.37) 


My 1 
ms 2m 


K3 my + m3 


1 0 
m, 1 
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Similarly one calculates 


K4 = M4 — 4mgm, — 3m? + 12mm? — 6m}. (4.38) 


The same type of process is used to find the moments of the cumulants: 


Mm = Ky 

me = Kot Ke 

m3; = 3+ 3k 9K, + KS 

Ms = Ka +4k3K1 + 3K + 6kok? + Ki. (4.39) 


A very simple relationship can be observed between the semi-invariants and the central 
moments pi, = E(X — y)*, where ps = m, as defined before. In fact, Kg = 2, K3 = fg and 
Ka = [la — Si. 


Skewness 73 and kurtosis y4 are defined as: 


ys = E(X 1) %/o8 
uu = E(X—p)A/o4. (4.40) 


The skewness and kurtosis determine the shape of one-dimensional distributions. The skew- 
ness of a normal distribution is 0 and the kurtosis equals 3. The relation of these parameters 
to the cumulants is given by: 


3 = BR (4.41) 
Ko 
K4 
= —. 4.42 
V4 Ke ( ) 


These relations will be used later in Section 18.2 on Projection Pursuit to determine devia- 
tions from normality. 


iar Summary 


<— The expectation of a random vector X is 4 = { xf (x) dx, the covariance 
matrix © = Var(X) = E(X — )(X — p)'. We denote X ~ (pu, 5). 

<+ Expectations are linear, ie., E(aX + BY) =aEX + GEY. If X and Y 
are independent, then E(XY') = EX EY'™. 


<+ The covariance between two random vectors X and Y is Nyy = 
Cov(X,Y) = E(X — EX)(Y — EY)! = E(XY"')—- EXEY'™. If X and 
Y are independent, then Cov(X,Y) = 0. 
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Summary (continued) 


<= The characteristic function (cf) of a random vector X is yx (t) = E(e#’*). 


<> The distribution of a p-dimensional random variable X is completely deter- 
mined by all one-dimensional distributions of t' X where t € R? (Theorem 
of Cramer-Wold). 


< The conditional expectation E(X2|X1) is the MSE best approximation of 
X» by a function of X,. 


4.3. Transformations 


Suppose that X has pdf fx(x). What is the pdf of Y = 3X? Or if X = (X1, X2, _X3)', what 
is the pdf of 


3X, 
Y=] X,-—4X. |? 
X3 
This is a special case of asking for the pdf of Y when 
X=uUly) (4.43) 


for a one-to-one transformation u: R? — R”. Define the Jacobian of u as 


r= (B=) 
Oy; Oy; 
and let abs(|7|) be the absolute value of the determinant of this Jacobian. The pdf of Y is 
given by 


f(y) = abs(|J]) > fxtu(y)}- (4.44) 


Using this we can answer the introductory questions, namely 


1 
Ginna SU c ekg) = 3 (Yi Yp) | 


with 


wl 


J= 
0 3 
1 
and hence abs(|7|) = (3)”. So the pdf of Y is api x (5). 
This introductory example is a special case of 


Y = AX +b, where A is nonsingular. 
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The inverse transformation is 


X=Al(Y —D). 
Therefore 
i ae 
and hence 
f(y) = abs(|A|") fx{A-*(y — b)}- (4.45) 


EXAMPLE 4.12 Consider X = (X1, X2) € R? with density fx(x) = fx(x1, 22), 


Then 
_ X,+ Xo 
year sbe (24%) 
and ; ; 
_ Ze “1 —1 -l 
MI=-2 abs(l4)=5, At=-3( 7) 7). 
Hence 


f(y) = abs(|Al™) - fx(A~*y) 
-3eG(1 2)(2)} 


= Fix {5 tun) zw) (4.46 


EXAMPLE 4.13 Consider X € R! with density fx(x) and Y = exp(X). According to (4.43) 
x = u(y) = log(y) and hence the Jacobian is 


dx 1 
— 
dy y 


The pdf of Y is therefore: 
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a" Summary 


— If X has pdf fx(x), then a transformed random vector Y, i.e., X = u(Y), 
has pdf fy(y) = abs(|7|)- fx{u(y)}, where 7 denotes the Jacobian 7 = 
du(yi) 


Oy; 
<> In the case of a linear relation Y = AX +6 the pdf’s of X and Y are 


related via fy(y) = abs(|A|7!)fx{A7!(y — b)}. 


4.4 The Multinormal Distribution 
The multinormal distribution with mean py and covariance © > 0 has the density 
1 
f(e) = IenzPexp {-3e- ny Ee—w)} (4.47) 


We write X ~ Np(u, d). 


How is this multinormal distribution with mean jz and covariance © related to the multivari- 
ate standard normal N,,(0,Z,)? Through a linear transformation using the results of Section 
4.3, as shown in the next theorem. 


THEOREM 4.5 Let X ~ N,(,X) and Y = %~/?(X — py) (Mahalanobis transformation). 
Then 
¥ es NN, (0,25), 


i.e., the elements Y; € R are independent, one-dimensional N(0,1) variables. 


Proof: 
Note that (X — p)'S-1(X — yw) =YTY. Application of (4.45) gives J = 5'/?, hence 


fry) = (2n)-Pexp (—5u"y) (4.48) 


which is by (4.47) the pdf of a N,(0,Z,). 
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Note that the above Mahalanobis transformation yields in fact a random variable Y = 
(Y1,...,¥,)' composed of independent one-dimensional Y; ~ Nj(0, 1) since 


fru) = Bopper (-30"v) 


II 
7 3 
$|- 
SS) 

fo) 

na 
Ze) 
7 
Ww | 
as 
Nee 


| 
I~ 
= 
© 


Here each fy,(y) is a standard normal density Jar exP (4). From this it is clear that 
E(Y) =0 and Var(Y) = Z,. 


How can we create N,(j1,%) variables on the basis of N,(0,Z,) variables? We use the inverse 
linear transformation 

X=dVvy + p. (4.49) 
Using (4.11) and (4.23) we can also check that E(X) = pw and Var(X) = %. The following 
theorem is useful because it presents the distribution of a variable after it has been linearly 
transformed. The proof is left as an exercise. 


THEOREM 4.6 Let X ~ N,(u,%) and A(p x p), c € R”, where A is nonsingular. 


Then Y = AX +c is again a p-variate Normal, 1.e., 
Y~N,(Auwtc, AXA‘). (4.50) 


Geometry of the N,,(:, &) Distribution 


From (4.47) we see that the density of the N,(1,&) distribution is constant on ellipsoids of 
the form 
@=p)S"G— ped. (4.51) 


EXAMPLE 4.14 Figure 4.3 shows the contour ellipses of a two-dimensional normal distri- 
bution. Note that these contour ellipses are the iso-distance curves (2.34) from the mean of 
this normal distribution corresponding to the metric S71. 


According to Theorem 2.7 in Section 2.6 the half-lengths of the axes in the contour ellipsoid 
are ,/ e where v; = x are the eigenvalues of U~! and A, are the eigenvalues of ©. The 
rectangle inscribing an ellipse has sides with length 2do; and is thus naturally proportional 
to the standard deviations of X; (¢ = 1, 2). 


The distribution of the quadratic form in (4.51) is given in the next theorem. 
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normal sample contour ellipses 


Figure 4.3. Scatterplot of a normal sample and contour ellipses for py = ‘e@ 


and © = oP Bree Q MVAcontnorm.xpl 


THEOREM 4.7 If X ~ N,(u,%), then the variable U = (X — p)'X7'(X — p) has a x? 
distribution. 


THEOREM 4.8 The characteristic function (cf) of a multinormal N,(u,X) is given by 


1 
yx(t) = exp(i t' uw — 5t Dt). (4.52) 
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We can check Theorem 4.8 by transforming the cf back: 


= <a WT oy sp ST Ca 
ie). = aap | ov ( it'x+it py 5t zt) dt 


1 lot aoa Ty-l 
spestmaeae f op |—3(t7Bt +257 — 2) - @- TE @- w)} 


cexp [-5{(e— )E(e — at 


— PEE 3 — p)'X(a- wh 


since 
1 1 ; - 
/ Joab tt72 OXP Sea + 2it" (@ — uw) — (@— p) Xa — wh dt 


= i Sesye 5 tt +i (ae — p)) X(t +i (a2 — wy dt 
="). 


Note that if Y ~ N,(0,Z,) (e.g., the Mahalanobis-transform), then 


1 i< 
exp (- OD) = exp (-j >) 
i=1 


= yy (ti)... Py, (tp) 
which is consistent with (4.33). 


yy (t) 


Singular Normal Distribution 


Suppose that we have rank(=) = k < p, where p is the dimension of X. We define the 
(singular) density of X with the aid of the G-Inverse ©” of &, 


a)-k/2 
fe) = peer {-5e- Ee (4.53) 


where 
(1) z lies on the hyperplane NV | (a—y) = 0 with N(px (p—k)): N'X =OandN'N = Ty. 


(2) S~ is the G-Inverse of U, and A;,..., Az are the nonzero eigenvalues of ¥. 


What is the connection to a multinormal with k-dimensions? If 
Yu N;,.(0, Ay) and Ay = diag(A1, eer Ak) (4.54) 


then there exists an orthogonal matrix B(p x k) with B'B = T, so that X = BY + py where 
X has a singular pdf of the form (4.53). 
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Gaussian Copula 


The second important copula that we want to present is the Gaussian or normal copula, 


Bu) p&y"(v) 
CA(u, v) = / / fo(X1, 2a) daoda , (4.55) 


see Embrechts, McNeil and Straumann (1999). In (4.55), f, denotes the bivariate normal 
density function with correlation p for n = 2. The functions ©; and ®, in (4.55) refer to the 
corresponding one-dimensional standard normal cdfs of the margins. 


In the case of vanishing correlation, p = 0, the Gaussian copula becomes 


d; *(u) > '(v) 
Co(u,v) = 7 fx(cjder | fx, (X2)dxe 


lo) —oo 


= we 
= (te) 2 


at Summary 


<— The pdf of a p-dimensional multinormal X ~ N,(j1, X) is 


f(e) = Prz/Pexp {—5e— WTB Me wh 


The contour curves of a multinormal are ellipsoids with half-lengths pro- 
portional to /\;, where \; denotes the eigenvalues of © (i= 1,...,p). 


< The Mahalanobis transformation transforms X ~ N,(u,u) to Y = 
y-/2(X — pw) ~ N,(0,Z,). Going the other direction, one can create 
a X ~ N,(u,X) from Y ~ N,(0,Z>) via X = EV?Y + p. 

< If the covariance matrix © is singular (i.e., rank(S)) < p), then it defines 
a singular normal distribution. 


<— The density of a singular normal distribution is given by 


)-k/2 
oe oP {Ze nye wh. 
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4.5 Sampling Distributions and Limit Theorems 


In multivariate statistics, we observe the values of a multivariate random variable X and 
obtain a sample {x;}"_,, as described in Chapter 3. Under random sampling, these obser- 
vations are considered to be realizations of a sequence of i.i.d. random variables X1,..., Xn, 
where each X; is a p-variate random variable which replicates the parent or population ran- 
dom variable X. Some notational confusion is hard to avoid: X; is not the ith component 
of X, but rather the 7th replicate of the p-variate random variable X which provides the ith 
observation x; of our sample. 


For a given random sample X),..., Xn, the idea of statistical inference is to analyze the 
properties of the population variable X. This is typically done by analyzing some charac- 
teristic @ of its distribution, like the mean, covariance matrix, etc. Statistical inference in a 
multivariate setup is considered in more detail in Chapters 6 and 7. 


Inference can often be performed using some observable function of the sample Xj,..., Xn, 
i.e., a statistics. Examples of such statistics were given in Chapter 3: the sample mean 7, 
the sample covariance matrix S. To get an idea of the relationship between a statistics and 
the corresponding population characteristic, one has to derive the sampling distribution of 
the statistic. The next example gives some insight into the relation of (%,S') to (u,&). 


EXAMPLE 4.15 Consider an iid sample of n random vectors X; € R? where E(X;) = u 
and Var(X;) = 4. The sample mean Z and the covariance matrix S have already been defined 
in Section 3.3. It is easy to prove the following results 


He) = LD B(%)=n 
Var(z) = 23 Var(X;) = 40 = E(2z") — py" 
BS) = AB LS (X—2)(%-2)" 

= if {EXX: — nix! 

= gin(l+un')—n(F+un')} 

= oy 


| 


This shows in particular that S is a biased estimator of &. By contrast, S, = —{S is an 
unbiased estimator of &. 


Statistical inference often requires more than just the mean and/or the variance of a statistic. 
We need the sampling distribution of the statistics to derive confidence intervals or to define 
rejection regions in hypothesis testing for a given significance level. Theorem 4.9 gives the 
distribution of the sample mean for a multinormal population. 
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THEOREM 4.9 Let X\,...,Xp be i.i.d. with X; ~ N,(u,X). Then = ~ N,(p, +E). 


Proof: 
& = (1/n) >°_, X; is a linear combination of independent normal variables, so it has a nor- 
mal distribution (see chapter 5). The mean and the covariance matrix were given in the 
preceding example. 


With multivariate statistics, the sampling distributions of the statistics are often more dif- 
ficult to derive than in the preceding Theorem. In addition they might be so complicated 
that approximations have to be used. These approximations are provided by limit theorems. 
Since they are based on asymptotic limits, the approximations are only valid when the sam- 
ple size is large enough. In spite of this restriction, they make complicated situations rather 
simple. The following central limit theorem shows that even if the parent distribution is 
not normal, when the sample size n is large, the sample mean Z has an approximate normal 
distribution. 


THEOREM 4.10 (Central Limit Theorem (CLT)) Let X1,Xo,...,X, be iid. with 
X; ~ (u,). Then the distribution of /n(E — pw) is asymptotically N,(0,%), i.e., 


JVn(E — 4) — N,(0,E) as n— ov. 


The symbol «*.” denotes convergence in distribution which means that the distribution 
function of the random vector \/n(Z — 4) converges to the distribution function of N,(0,&). 


EXAMPLE 4.16 Assume that X1,...,Xy are i.i.d. and that they have Bernoulli distribu- 
tions where p = + (this means that P(X; = 1) = 5, P(X; =0) = $4). Then p =p = and 
u = p(1—p) = 4. Hence, 


1 1 
vi (7-5) (03) as nm —> 0d. 
The results are shown in Figure 4.4 for varying sample sizes. 
EXAMPLE 4.17 Now consider a two-dimensional random sample X,,...,Xn that is i.1.d. 
and created from two independent Bernoulli distributions with p = 0.5. The joint distribution 


is given by P(X; = (0,0)") = 3,P(Xi = (0,1)') = 3,P(% = (1,0)") = 3, P(X = 
(1,1)') = 4. Here we have 


o-()}-"(0.G)) © = 


Figure 4.5 displays the estimated two-dimensional density for different sample sizes. 
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Asymptotic Distribution, N=5 Asymptotic Distribution, N=35 
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Figure 4.4. The CLT for Bernoulli distributed random variables. Sample 
size n = 5 (left) and n = 35 (right). Q MVAcltbern.xpl 
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Figure 4.5. The CLT in the two-dimensional case. Sample size n = 5 
(left) and n = 85 (right). Q MVAcltbern2.xp1l 


The asymptotic normal distribution is often used to construct confidence intervals for the 
unknown parameters. A confidence interval at the level 1 — a, a € (0,1), is an interval that 
covers the true parameter with probability 1 — a: 


P(6 € [6;,0,]) =1—a, 
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where @ denotes the (unknown) parameter and A; and a, are the lower and upper confidence 
bounds respectively. 


EXAMPLE 4.18 Consider the i.i.d. random variables X1,...,Xn with X; ~ (1,07) and o? 
known. Since we have \/n(Z — pL) = N(0,07) from the CLT, it follows that 


P(—y~a/2 = dae — H) = U1—a/2) — > a, as 1 —> OO 
ol 
where Uj—a/2 denotes the (1 — a/2)-quantile of the standard normal distribution. Hence the 


interval 
o 


_ o 
i U1-a/2, © + fa U1—a/2 


is an approximate (1 — a)-confidence interval for [. 


i 


But what can we do if we do not know the variance 07? The following corollary gives the 
answer. 


COROLLARY 4.1 /f S is a consistent estimate for &, then the CLT still holds, namely 
Jn SY? (z — pp)  N,(0,Z) as n—> OO. 


EXAMPLE 4.19 Consider the i.i.d. random variables X1,...,Xn with X; ~ (t,07), and 
now with an unknown variance 07. From Corollary 4.1 using a? = 4 Soi (ti —£)? we obtain 


vi (3) £.N(0,1) as n— ov. 
Oo 


Hence we can construct an approximate (1 — a)-confidence interval for js using the variance 
estimate G7: 7 _ 


Oo Oo 
Cio = [2 _ SF U1-a/2) oe eel 


Jn Jn Uj1—a/2 ‘ 
Note that by the CLT 


P(wE€ Cia) — 1l-a as n— oO. 


REMARK 4.1 One may wonder how large should n be in practice to provide reasonable 
approximations. There is no definite answer to this question: it mainly depends on the 
problem at hand (the shape of the distribution of the X; and the dimension of X;). If the 
X; are normally distributed, the normality of % is achieved from n = 1. In most situations, 
however, the approximation is valid in one-dimensional problems for n larger than, say, 50. 
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Transformation of Statistics 


Often in practical problems, one is interested in a function of parameters for which one has 
an asymptotically normal statistic. Suppose for instance that we are interested in a cost 
function depending on the mean yp of the process: f(j:) = 'Ap where A > 0 is given. To 
estimate ju we use the asymptotically normal statistic z. The question is: how does f(Z) 
behave? More generally, what happens to a statistic t that is asymptotically normal when 
we transform it by a function f(t)? The answer is given by the following theorem. 


THEOREM 4.11 Jf \/n(t — 2) “> N, (0,5) and if f = (fi,-..,f,)" : R? > RY are real 
valued functions which are differentiable at  € R’, then f(t) is asymptotically normal with 
mean f() and covariance D'ND, i.e., 


Vit f(t) — f(u)} — N,(0,D'ED) for _n —+ 00, (4.56) 


p= (Fr) 


is the (p x q) matrix of all partial derivatives. 


where 


t= 


EXAMPLE 4.20 We are interested in seeing how f(Z) = £' Az behaves asymptotically with 
respect to the quadratic cost function of w, f() = u'Ap, where A > 0. 


Of () 


D= 
Ox 


= DAL. 


rp 


By Theorem 4.11 we have 


Jie! At — pw! Ap) > Ny (0,47 AD Ap). 


EXAMPLE 4.21 Suppose 


0 1 0.5 


We have by the CLT (Theorem 4.10) for n — co that 


Vn(E — p) > N(0,5). 


Suppose that we would like to compute the distribution of ( 7 ) According to The- 
2 
orem 4.11 we have to consider f = (fi, fo)' with 


filtits) = ie =o, fo(%1,%2) =71+3a2, ¢=2. 
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Given this f(u) = (6) and 


Thus 


The covariance is 


(3) G 


0 
| 
DY yy D DY “uD D'SD 


vi ici ) (i) 8) 


EXAMPLE 4.22 Let us continue the previous example by adding one more component to the 
function f. Since gq =3 > p= 2, we might expect a singular normal distribution. Consider 


‘i = (fi, fo, fs)" with 


fi(x1, £2) = oe — £2, fo(X1, £2) =%2,+3%2, fz= i, g=3. 


rR dle 
Na 
—— 
| 
Ow 
SY 
I| 
as 
e- © 
| 
We 
Se 
—N~ 
| | 
RInlR 
NIN [ot 
Na 
| 
—/~ 
| ya 
boIN 
re | 
Wrwoin 
Na 


which yields 


From this we have that 


Late 10 

2 
D=(_§ : 1) and thus D'SD={ —-% 13 0 
0 0 0 


The limit is in fact a singular normal distribution! 


nae Summary 


— If X,,...,X, are iid. random vectors with X; ~ N,(u,u), then = ~ 
No(t; +5). 

— If X1,...,Xp are ii.d. random vectors with X; ~ (u,X), then the distri- 
bution of \/n(¥ — y) is asymptotically N(0, 4) (Central Limit Theorem). 


— If X,,...,X, arei.i.d. random variables with X; ~ (u,o), then an asymp- 
totic confidence interval can be constructed by the CLT: % + aa Ua 2: 


— Iftisastatistic that is asymptotically normal, i.e., \/n(t—p) = N,(0, %), 


then this holds also for a function f(t), ie., /n{f(t) — f(u)} is asymp- 
totically normal. 
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4.6 Bootstrap 


Recall that we need a large sample sizes in order to sufficiently approximate the critical 
values computable by the CLT. Here large means n= 50 for one-dimensional data. How can 
we construct confidence intervals in the case of smaller sample sizes? One way is to use a 
method called the Bootstrap. The Bootstrap algorithm uses the data twice: 


1. estimate the parameter of interest, 


2. simulate from an estimated distribution to approximate the asymptotic distribution of 
the statistics of interest. 


In detail, bootstrap works as follows. Consider the observations 71,...,% of the sample 
X1,...,Xp and estimate the empirical distribution function (edf) F,. In the case of one- 
dimensional data 


= 3 I(X; < 2). (4.57) 


This is a step function which is constant between neighboring data points. 


EXAMPLE 4.23 Suppose that is have n = 100 standard normal N(0,1) data points X;, 
i=1,...,n. The cdf of X is O(x 7 ee p(u)du and is shown in Figure 4.6 as the thin, 
solid ine. The empirical Hane Funetion ( a ) is displayed as a thick step function line. 
Figure 4.7 shows the same setup for n = 1000 observations. 


Now draw with replacement a new sample from this empirical distribution. That is we 
sample with replacement n* observations Xy7,...,X;. from the original sample. This is 
called a Bootstrap sample. Usually one takes n* = n. 


Since we sample with replacement, a single observation from the original sample may ap- 
pear several times in the Bootstrap sample. For instance, if the original sample consists of 
the three observations 21, 72,23, then a Bootstrap sample might look like Xf = x3, X3 = 
tq, X3 = £3. Computationally, we find the Bootstrap sample by using a uniform random 
number generator to draw from the indices 1,2,...,n of the original samples. 


The Bootstrap observations are drawn randomly from the empirical distribution, i.e., the 
probability for each original observation to be selected into the Bootstrap sample is 1/n for 
each draw. It is easy to compute that 


Ep (X*) => a, = &. 
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EDF and CDF, n=100 


edf(x), cdf(x) 
0.5 


Figure 4.6. The standard normal cdf (thin line) and the empirical distri- 
bution function (thick line) for n = 100. Q MVAedfnormal.xpl 


This is the expected value given that the cdf is the original mean of the sample x...., 2p. 
The same holds for the variance, i.e., 


*\ _ a2 
Var(X;) = O's 


where G? = + )°(x; — Z)?. The cdf of the bootstrap observations is defined as in (4.57). 


Figure 4.8 shows the cdf of the n = 100 original observations as a solid line and two bootstrap 
cdf’s as thin lines. 


The CLT holds for the bootstrap sample. Analogously to Corollary 4.1 we have the following 
corollary. 
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EDF and CDF, n=1000 


edf(x), cdf(x) 
0.5 


Figure 4.7. The standard normal cdf (thin line) and the empirical distri- 
bution function (thick line) for mn = 1000. @ MVAedfnormal.xpl 


COROLLARY 4.2 /f X{,...,X; is a bootstrap sample from Xy,...,Xn, then the distribu- 


tion of | 
u*-—& 
oe) 
o* 


also becomes N(0,1) asymptotically, where Z* = +>", X} and (6*)? = + 0" (X} — 2*). 


How do we find a confidence interval for js using the Bootstrap method? Recall that the 
quantile u;_4/2 might be bad for small sample sizes because the true distribution of Jn (=*) 
might be far away from the limit distribution (0,1). The Bootstrap idea enables us to “sim- 
ulate” this distribution by computing /n (=) for many Bootstrap samples. In this way 
we can estimate an empirical (1 — a/2)-quantile uj_, /2" The bootstrap improved confidence 
interval is then 


n~ ~ 


oO oO 
* — [a aye = at 
Cm = |t— Uj_a/ar v ao vn Uj_a/2 


Vin 


4.6 Bootstrap 


151 


EDF and 2 bootstrap EDF’s, n=100 
S 
=3 
3 
= T T T T T J 
-2 -1 0 1 2 3 
xX 


Figure 4.8. The cdf F,, (thick line) and two bootstrap cdf‘s F* (thin lines). 
Q MVAedfbootstrap.xpl 


By Corollary 4.2 we have 
P(we Ct.) —1l-a asn-~m, 


but with an improved speed of convergence, see Hall (1992). 


mn Summary 


<>» For small sample sizes the bootstrap improves the precision of the confi- 
dence interval. 


<+ The bootstrap distribution L(,/n(%* — ¥)/a*) converges to the same 
asymptotic limit as the distribution L(./n(¥ — ps)/c). 
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4.7 Exercises 


EXERCISE 4.1 Assume that the random vector Y has the following normal distribution: 
Y ~N,(0,Z). Transform it according to (4.49) to create X ~ N(1,¥) with mean p = (3,2)! 


and i = (ae ail How would you implement the resulting formula on a computer? 


EXERCISE 4.2 Prove Theorem 4.7 using Theorem 4.5. 


EXERCISE 4.3 Suppose that X has mean zero and covariance i = iG a Let Y = X%,4+ Xo. 
Write Y as a linear transformation, i.e., find the transformation matrix A. Then compute 


Var(Y) via (4.26). Can you obtain the result in another fashion? 
EXERCISE 4.4 Calculate the mean and the variance of the estimate 8 in (3.50). 


EXERCISE 4.5 Compute the conditional moments E(X2 | x1) and E(X, | x2) for the pdf 
of Example 4.6. 


EXERCISE 4.6 Prove the relation (4.28). 


EXERCISE 4.7 Prove the relation (4.29). Hint: Note that Var(E(X2|X1)) = 
E(E(X9|X1) E(X2|X1)) — E(X2) E(Xz)) and that E(Var(X2|X1)) = ELE(X2X7|X1) — 
E(X2|X1) E(X7|X1)]. 


EXERCISE 4.8 Compute (4.46) for the pdf of Example 4.5. 


EXERCISE 4.9 


sy — Gye O<m <2, lyw|<1-|l—y| 


is a pdf! 
0 otherwise pdf 


Show that fy(y) = 


EXERCISE 4.10 Compute (4.46) for a two-dimensional standard normal distribution. Show 
that the transformed random variables Y; and Y2 are independent. Give a geometrical inter- 
pretation of this result based on iso-distance curves. 


EXERCISE 4.11 Consider the Cauchy distribution which has no moment, so that the CLT 
cannot be applied. Simulate the distribution of & (for different n’s). What can you expect 
forn— w? 

Hint: The Cauchy distribution can be simulated by the quotient of two independent standard 
normally distributed random variables. 
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EXERCISE 4.12 A European car company has tested a new model and reports the con- 
sumption of gasoline (X,) and oil (Xz). The expected consumption of gasoline is 8 liters per 
100 km (1) and the expected consumption of oil is 1 liter per 10.000 km ({12). The mea- 
sured consumption of gasoline is 8.1 liters per 100 km (¥,) and the measured consumption 


of oil is 1.1 liters per 10,000 km (2). The asymptotic distribution of v4) — yy is 
0) / 0.1 0.05 

N (9); (005 0.1): 

For the American market the basic measuring units are miles (1 mile © 1.6 km) and gallons 

(1 gallon © 38.8 liter). The consumptions of gasoline (Y,) and oil (Y2) are usually reported 


in miles per gallon. Can you express Y, and Yo in terms of %, and %2? Recompute the 
asymptotic distribution for the American market! 


EXERCISE 4.13 Consider the pdf f(x1,22) = e~ +"), 21,29 > 0 and let U; = X, + Xp 
and Uz = X; — X2. Compute f (ui, ua). 


EXERCISE 4.14 Consider the pdf‘s 


Tei) = Agriare7*1 £1, 22 > 0, 
f (eit) = 1 0<21,%2 <1 andxa,+22<1 
f(@uas) = se Ly > |ro\. 


For each of these pdf‘s compute E(X), Var(X), E(X4|X2), E(X2|X1), V(X1|Xe) and V(X2|X1). 


1 
EXERCISE 4.15 Consider the pdf f(x1,22) = Sr) 2 OS ay Sty <1. Compute PUG = 
0.25), P(X» < 0.25) and P(X» < 0.25|X, < 0.25). 


EXERCISE 4.16 Consider the pdf f(a1,x72) = =, ee ay <2 0 << re < I, 
Let U; = sin X1,/—2 log X2 and Uz = cos X1,\/—2 log X2. Compute f (ur, U2). 


EXERCISE 4.17 Consider f(x, 22,73) = k(a1 + rox3); 0 < 21, 22,23 < 1. 


a) Determine k so that f is a valid pdf of (X1, X2, X3) = X. 
b) Compute the (3 x 3) matrix Xx. 


c) Compute the (2 x 2) matrix of the conditional variance of (X2,X3) given X, = 2}. 


EXERCISE 4.18 Let X ~ N; @ ( 5 Di 


a) Represent the contour ellipses for a = 0; —5; +5; iL, 
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b) Fora= 5 find the regions of X centered on 4s which cover the area of the true parameter 
with probability 0.90 and 0.95. 


EXERCISE 4.19 Consider the pdf 
1 _( 21422 
f (x1, £2) ae (s5+ #) L1,X2 > 0. 


Compute f(x2) and f(x\|x2). Also give the best approximation of X, by a function of Xo. 
Compute the variance of the error of the approximation. 


EXERCISE 4.20 Prove Theorem 4.6. 


5 Theory of the Multinormal 


In the preceeding chapter we saw how the multivariate normal distribution comes into play 
in many applications. It is useful to know more about this distribution, since it is often 
a good approximate distribution in many situations. Another reason for considering the 
multinormal distribution relies on the fact that it has many appealing properties: it is stable 
under linear transforms, zero correlation corresponds to independence, the marginals and all 
the conditionals are also multivariate normal variates, etc. The mathematical properties of 
the multinormal make analyses much simpler. 


In this chapter we will first concentrate on the probabilistic properties of the multinormal, 
then we will introduce two “companion” distributions of the multinormal which naturally 
appear when sampling from a multivariate normal population: the Wishart and the Hotelling 
distributions. The latter is particularly important for most of the testing procedures pro- 
posed in Chapter 7. 


5.1 Elementary Properties of the Multinormal 


Let us first summarize some properties which were already derived in the previous chapter. 
e The pdf of X ~ N,(p1, 4) is 


1 
f(e) = BrzP exp {—3e— ny Dew) (6.1) 
The expectation is E(X) = wp, the covariance can be calculated as Var(X) = 
E(X — p(X —p)" =. 


e Linear transformations turn normal random variables into normal random variables. 
If X ~ N,(p, 4) and A(p x p),c € R®, then Y = AX +c is p-variate Normal, i.e., 


Y ~ N,(Autc, ADA). (5.2) 


e If X ~ N,(, %), then the Mahalanobis transformation is 
Y = ¥?(X — p) ~ N,(0, Zp) (5.3) 
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and it holds that 
YY =O- 9)" Spy. (5.4) 


Often it is interesting to partition X into sub-vectors X, and X2. The following theorem 
tells us how to correct X2 to obtain a vector which is independent of Xj. 


THEOREM 5.1 Let X = i) ~ N,(u, 4), X17 € R", Xo € R?-”. Define Xo, = Xo - 
Mg from the partitioned covariance matrix 


Then 
Xs N,(t1, 211), (5.5) 
Xo Np—r(H2.1; %92.1) (5.6) 
are independent with 
21 = p22 — Da hy ba; Y92.1 = Ue2 — Sig ome (5.7) 


Proof: 


Xo4 = BX with B= [ —Yad] 5 TL 


por |. 


Then, by (5.2) X; and X2; are both normal. Note that 


Cov(X1, X21) = ADB' = i 0 dy Lae I 0 
Yat Yee . 


Ax 


My 
(3 sy) =n Ba) 
= ALE” =. (ag Big) ( ( “aPi) ) = (-Z11 (aea) + Ei2) 


Recall that Yat = ig)": Hence AxB! = =H hy hw + d12 =0! 
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Using (5.2) again we also have the joint distribution of (X,, X21), namely 


(in) la) (CR) CO mas) 


With this block diagonal structure of the covariance matrix, the joint pdf of (X1, X21) can 
easily be factorized into 


1 1 
f(®1, 221) = |27Xqi|~? exp {asl — pn) Ey (a — my} x 


1 
|20Do9.1|7? exp {o5lu — isa) hes aaa — bo.)} 


from which the independence between X, and X2, follows. 


The next two corollaries are direct consequences of Theorem 5.1. 


dy dye 


X 
ROLLARY 5.1 Let X = “ = 
CORO 5.1 Le (x) N,(u, =), = & oi 


X, 1s independent of Xo. 


) Nig = 0 if and only if 


The independence of two linear transforms of a multinormal X can be shown via the following 
corollary. 


COROLLARY 5.2 Jf X ~ N,(,&) and given some matrices A and B , then AX and BX 
are independent if and only if ANB! = 0. 


The following theorem is also useful. It generalizes Theorem 4.6. The proof is left as an 
exercise. 


THEOREM 5.2 If X ~ N,(u,%), A(q x p), c € R? andq < p, thn Y = AX +c isa 


q-variate Normal, 1.e., 


Y~N,(Aut+e, ADA"). 
The conditional distribution of Xy given X, is given by the next theorem. 


THEOREM 5.3 The conditional distribution of X2 given X, = 21 1s normal with mean 
a + De hq7 (21 — pi) and covariance Dz21, 1.€., 


(X_ | X1 = 21) ~ Np-r(a + Var Dyy (wi — p41), De2.1). (5.8) 
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Proof: 
Since X59 = X91 + Yor X4, for a fixed value of X, = 21, X2 is equivalent to X21 plus a 
constant term: 

(Xo|X1 = 21) = (Xai + Say 2 


which has the normal distribution N (p21 + Mai Mig £15 3120.1). O 


Note that the conditional mean of (X> | Xj) is a linear function of X 1 and that the conditional 
variance does not depend on the particular value of X,. In the following example we consider 
a specific distribution. 


EXAMPLE 5.1 Suppose that p = 2,r = 1, wp = (3) and & = ( - =) Then 


Ma = 1, Mot = —0.8 and 991 = Yo = Mor Pa is =2— (0.8)? = 1.36. Hence the marginal 
pdf of X, is 


and the conditional pdf of (X_ | X1 = 21) is given by 


Flee |e) = oe} 


Soe), { 2 x (1.36) 
As mentioned above, the conditional mean of (X2 | X1) is linear in X1. The shift in the 


density of (X2 | X1) can be seen in Figure 5.1. 


Sometimes it will be useful to reconstruct a joint distribution from the marginal distribution 
of X, and the conditional distribution (X2|X,). The following theorem shows under which 
conditions this can be easily done in the multinormal framework. 


THEOREM 5.4 /f X1 ~ N,(fu, 411) and (X2|X1 = 21) ~ N,--(Ati + b,Q) where Q does 
not depend on x1, then X = (- ) ~ Np(p, 2), where 


2 
vi di rnAt 
- Axi Q a ArnA' 


EXAMPLE 5.2 Consider the following random variables 


Xy om N,(0, 1), 


2 1 O 
nina ((2).(5)) 
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conditional normal densities f(x2|x1) 


Figure 5.1. Shifts in the conditional density. Q MVAcondnorm.xp1l 


Using Theorem (5.4), where A= (2 1)',b=(0 1)' and Q“ = Ty, we easily obtain the 
following result: 


0 i 24 
x= (4b )o™ GLa 2 a 2 
i 122 


In particular, the marginal distribution of X_ ts 


s-m((3).(84)) 


thus conditional on X1, the two components of X2 are independent but marginally they are 
not! 


Note that the marginal mean vector and covariance matrix of X2 could have also been com- 
puted directly by using (4.28)-(4.29). Using the derivation above, however, provides us with 
useful properties: we have multinormality! 
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Conditional Approximations 


As we saw in Chapter 4 (Theorem 4.3), the conditional expectation E(X2|X,) is the mean 
squared error (MSE) best approximation of X2 by a function of X,. We have in this case 
that 

Xo = E(Xa|X1) +U S pg + BE (A — fs) + U. (5.9) 


Hence, the best approximation of X2) € R°~" by X, © R" is the linear approximation that 
can be written as: 


with B = Yaa. Bo = fb. — Buy and Uw N(O, 92.1). 


Consider now the particular case where r = p— 1. Now X) € R and B is a row vector 3! of 
dimension (1 x r) 


Xo = Bo+B' Xi +U. (5.11) 


This means, geometrically speaking, that the best MSE approximation of X by a function 
of X, is hyperplane. The marginal variance of Xj can be decomposed via (5.11): 


022 = B'S 6 + 021 = oud O12 + 029.1. (5.12) 
The ratio ‘ 
> 

i. = (5.13) 
022 


is known as the square of the multiple correlation between X>2 and the r variables X,. It is the 
percentage of the variance of Xj which is explained by the linear approximation (py + 3! X4. 
The last term in (5.12) is the residual variance of Xj. The square of the multiple correlation 
corresponds to the coefficient of determination introduced in Section 3.4, see (3.39), but 
here it is defined in terms of the r.v. X; and X2. It can be shown that p21... is also the 
maximum correlation attainable between X»2 and a linear combination of the elements of X,, 
the optimal linear combination being precisely given by 3'X,. Note, that when r = 1, the 
multiple correlation p2,; coincides with the usual simple correlation px,x, between X2 and 
Xi. 


EXAMPLE 5.3 Consider the “classic blue” pullover example (Example 3.15) and suppose 
that X, (sales), X2 (price), X3 (advertisement) and X4 (sales assistants) are normally dis- 
tributed with 


172.7 1037.21 
104.6 ~80.02 219.84 
H= 1 jo40 | °"4™= | 1430.70 92.10 2624.00 
93.8 271,44 —91.58 210.30 177.36 


(These are in fact the sample mean and the sample covariance matrix but in this example 
we pretend that they are the true parameter values. ) 
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The conditional distribution of X, given (X2, X3, X4) is thus an univariate normal with mean 


X22 — pg 
fy +042455 | X3— pg | = 65.670 — 0.216 X2 + 0.485.X3 + 0.844.X4 
X4— H4 


and variance 

011.2 = O11 — O19%959 021 = 96.761 
The linear approximation of the sales (X,) by the price (X2), advertisement (X3) and sales 
assistants (X4) is provided by the conditional mean above.(Note that this coincides with 
the results of Example 3.15 due to the particular choice of 4 and Xi). The quality of the 

= 

approximation is given by the multiple correlation p?34 = Te = 0.907. (Note again 
that this coincides with the coefficient of determination r? found in Example 3.15). 
This example also illustrates the concept of partial correlation. The correlation matrix be- 
tween the 4 variables is given by 


1 —0.168 0.867 0.633 


P= —0.168 1 0.121 W—0.464 
— 0.867 0.121 1 0.308 |’ 
0.633 —0.464 0.308 1 


so that the correlation between X, (sales) and X» (price) is —0.168. We can compute the 
conditional distribution of (X1,X2) given (X3, X4), which is a bivariate normal with mean: 


Mi he 013. O14 033 034 ~ X3 — p3 _ 32.516 + 0.467.X3 + 0.977 X4 
2 023 O24 043 O44 X4 — pa a 153.644 + 0.085.X3 = 0.617X4 
and covariance matriz: 


=f 
O11 O12 _ 013° O14 033 034 031 032 = 104.006 
O21 022 023 O24 043. O44 O41 O42 —33.574 155.592 , 
In particular, the last covariance matrix allows the partial correlation between X, and X»y to 


be computed for a fixed level of X3 and X4: 


—33.574 
V 104.006 * 155.592 


PX 1X2|X3Xa = = —0.264, 


so that in this particular example with a fixed level of advertisement and sales assistance, the 
negative correlation between price and sales is more important than the marginal one. 
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car Summary 


— If X ~ N,(u,u), then a linear transformation AX +c, A(q x p), where 
c € R4, has distribution N,(Ay +c, ADA"). 


<> Two linear transformations AX and BX with X ~ N,(y, X) are indepen- 
dent if and only if AUB! = 0. 


— If X; and X, are partitions of X ~ N,(u,), then the conditional distri- 
bution of X_ given X, = 7, is again normal. 


<— In the multivariate normal case, X, is independent of X> if and only if 
M412 = 0. 


<— The conditional expectation of (X2|X,) is a linear function if e ~ 
No(t, 4). 


The multiple correlation coefficient is defined as p3,_,. = 


=I 
021441 712 


[ 


022 
The multiple correlation coefficient is the percentage of the variance of X2 
explained by the linear approximation (9 + 3'X4. 


f 


5.2 The Wishart Distribution 


The Wishart distribution (named after its discoverer) plays a prominent role in the analysis 
of estimated covariance matrices. If the mean of X ~ N,(,X) is known to be yw = 0, then 
for a data matrix ¥(n x p) the estimated covariance matrix is proportional to ¥' 4. This is 
the point where the Wishart distribution comes in, because M(p x p) = ¥'X = So", xix} 
has a Wishart distribution W,(%, n). 


EXAMPLE 5.4 Set p= 1, then for X ~ N,(0,07) the data matrix of the observations 


X= (a1,...,2n)' with M=X'X =) > xx; 


i=1 


leads to the Wishart distribution W,(o7,n) = 0*x2. The one-dimensional Wishart distribu- 
tion is thus in fact a x? distribution. 


When we talk about the distribution of a matrix, we mean of course the joint distribution 
of all its elements. More exactly: since M = ¥'X is symmetric we only need to consider 
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the elements of the lower triangular matrix 
m1 
M=a}o 0? (5.14) 
Mp Mg2 Mp 


Hence the Wishart distribution is defined by the distribution of the vector 


(mat, ..., Mp1, Maa, -.-,Mp2,.-+,Mpp)'. (5.15) 
Linear transformations of the data matrix ¥ also lead to Wishart matrices. 


THEOREM 5.5 Jf M ~ W,(,n) and B(p x q), then the distribution of B' MB is Wishart 
W, (BSB, n). 


With this theorem we can standardize Wishart matrices since with B = S~'/? the distribu- 
tion of S-/?M>-'/? is W,(Z,n). Another connection to the y?-distribution is given by the 
following theorem. 


THEOREM 5.6 /f M ~ W,(=,m), and a € R? with a'Sa 4 0, then the distribution of 


a! Ma 


a Ea 


is x2... 


This theorem is an immediate consequence of Theorem 5.5 if we apply the linear transfor- 
mation x ++ a'x. Central to the analysis of covariance matrices is the next theorem. 


THEOREM 5.7 (Cochran) Let X(n xp) be a data matrix from a N,(0,%) distribution and 
let C(n x n) be a symmetric matriz. 


(a) X'CX has the distribution of weighted Wishart random variables, i.e. 
XTCX = S°W,(E, 1), 
i=1 


where X;,7=1,...,n, are the eigenvalues of C. 
(b) X'CX is Wishart if and only if C? =C. In this case 
X'CX ~ W,(E,1), 


and r = rank(C) = tr(C). 
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(c) nS = X'HX is distributed as W,(X,n — 1) (note that S is the sample covariance 
matrix). 


(d) = and S are independent. 


The following properties are useful: 


1. If M~ W,(%,n), then E(M) = nu. 

2. If M; are independent Wishart W,(X,n;)i=1,--- ,k, then M = 0%, M; ~ W,(,n) 
where n = Sv) nj. 

3. The density of W,(%,n — 1) for a positive definite M is given by: 

|M|2(°-P-2) e—3 (ME) 


_ 5.16 
2.2P(M—1) 7 gP(P—1)| 99] 3 (NI) a) = 


fon—1(M) = 
where I is the gamma function, see Feller (1966). 


For further details on the Wishart distribution see Mardia, Kent and Bibby (1979). 


ar Summary 


<> The Wishart distribution is a generalization of the y?-distribution. In 
particular Wi(0?,n) = 072. 


<— The empirical covariance matrix S has a +W,(X,n — 1) distribution. 


< In the normal case, % and S are independent. 


— For M~ W,(=,m), ew x2. 
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5.3 Hotelling 7°-Distribution 


Suppose that Y € R? is a standard normal random vector, i.e., Y ~ N,(0,Z), independent 
of the random matrix M ~ W,(Z,n). What is the distribution of Y'M7~!Y? The answer 
is provided by the Hotelling T?-distribution: n Y'M'Y is Hotelling T? (p,n) distributed. 


The Hotelling T?-distribution is a generalization of the Student t-distribution. The gen- 
eral multinormal distribution N(j,¥) is considered in Theorem 5.8. The Hotelling T?- 
distribution will play a central role in hypothesis testing in Chapter 7. 


THEOREM 5.8 /f X ~ N,(u, &) is independent of M ~ W,(%,n), then 


n(X — p)'Mo'(X — p) ~ T*(p,n). 


COROLLARY 5.3 Jf is the mean of a sample drawn from a normal population N,(, X) 
and S is the sample covariance matrix, then 


(n —1)(@— p)"S(@—p) =n -p)"S.'(E-p)~T(pn-1. (6.17) 


Recall that S, = —4S is an unbiased estimator of the covariance matrix. A connection 
between the Hotelling T?- and the F-distribution is given by the next theorem. 


THEOREM 5.9 


np 
T?(p,n) = n—p+1 Fyn—p+t- 


EXAMPLE 5.5 In the univariate case (p=1), this theorem boils down to the well known 
result: 


= 2 
ae 2 2 
——— ~T 1 —] ay a = {2 . 
( =n) (Loa ) 1, 1 n—-1 


For further details on Hotelling T?-distribution see Mardia et al. (1979). The next corollary 
follows immediately from (3.23),(3.24) and from Theorem 5.8. It will be useful for testing 
linear restrictions in multinormal populations. 


COROLLARY 5.4 Consider a linear transform of X ~ N,(u,%), Y = AX where 
A(q x p) with (q <p). If® and Sx are the sample mean and the covariance matrix, we have 


y Az ~ N,(Ap, * Ana") 
nSy = nASxA! ond W,(AXA', n— 1) 
(n — 1)(Az — Ap)" (ASx Al) (AT — Ap) ~ T?(g,n — 1) 
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The T°? distribution is closely connected to the univariate t-statistic. In Example 5.4 we 
described the manner in which the Wishart distribution generalizes the y?-distribution. We 
can write (5.17) as: 


which is of the form 


Wishart random 


multivariate normal - matrix multivariate normal 
random vector degrees of freedom random vector 


This is analogous to 


t? = V/n(z — p)(s*) Vn — 1) 


or 
\?-random = 
normal variable normal 
( random variable ) degrees of freedom ( random variable ) 


for the univariate case. Since the multivariate normal and Wishart random variables are 
independently distributed, their joint distribution is the product of the marginal normal and 
Wishart distributions. Using calculus, the distribution of T? as given above can be derived 
from this joint distribution. 


kar Summary 


<> Hotelling’s T?-distribution is a generalization of the ¢-distribution. In 
particular T(1,n) = tn. 

— (n—1)(@—- p)'S'(e— p) has a T?(p,n — 1) distribution. 

<> The relation between Hotelling’s T?— and Fisher’s F-distribution is given 
by T%(p,n) = 27 Fym—paa- 
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5.4 Spherical and Elliptical Distributions 


The multinormal distribution belongs to the large family of elliptical distributions which has 
recently gained a lot of attention in financial mathematics. Elliptical distributions are often 
used, particularly in risk management. 


DEFINITION 5.1 A (px 1) random vector Y is said to have a spherical distribution S,(@) 
if its characteristic function Wy(t) satisfies: wy(t) = o(t't) for some scalar function ¢(.) 
which is then called the characteristic generator of the spherical distribution S,(¢). We will 
write Y ~ S,(¢). 


This is only one of several possible ways to define spherical distributions. We can see spherical 
distributions as an extension of the standard multinormal distribution N,(0, Z,). 


THEOREM 5.10 Spherical random variables have the following properties: 


1. All marginal distributions of a spherical distributed random vector are spherical. 
2. All the marginal characteristic functions have the same generator. 


3. Let X ~ S,(¢), then X has the same distribution as ru) where u) is a random vector 
distributed uniformly on the unit sphere surface in R? and r > 0 ts a random variable 
independent of u). If E(r?) < 00, then 


E(r?) 


E(X)=0, Cov(x)= ; 


Tp. 


The random radius r is related to the generator @ by a relation described in Fang, Kotz and 
Ng (1990, p.29). The moments of X ~ S,(¢), provided that they exist, can be expressed in 
terms of one-dimensional integrals (Fang et al., 1990). 


A spherically distributed random vector does not, in general, necessarily possess a density. 
However, if it does, the marginal densities of dimension smaller than p—1 are continuous and 
the marginal densities of dimension smaller than p — 2 are differentiable (except possibly at 
the origin in both cases). Univariate marginal densities for p greater than 2 are nondecreasing 
on (—oo, 0) and nonincreasing on (0, 00). 


DEFINITION 5.2 A (px 1) random vector X is said to have an elliptical distribution with 
parameters ju(p X 1) and X(p x p) if X has the same distribution as u+A'Y, where Y ~ 
Si(¢) and A is a (k x p) matrix such that A'A = © with rank(=) = k. We shall write 
gor ECy(u, x, @). 
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REMARK 5.1 The elliptical distribution can be seen as an extension of Np(f, X). 


EXAMPLE 5.6 The multivariate t-distribution. Let Z ~ N,(0,Z,) and s ~ x2, be inde- 
pendent. The random vector 


Z 
Pau 
s 


has a multivariate t-distribution with m degrees of freedom. Moreover the t-distribution 
belongs to the family of p-dimensioned spherical distributions. 


EXAMPLE 5.7 The multinormal distribution. Let X ~ N,(u,X). Then X ~ EC,(p,%,¢) 
and o(u) = exp (—u/2). Figure 4.3 shows a density surface of the multivariate normal distri- 
bution: f(x) = det(2xX)~2 exp{—3(a—p)'I7'(a@—p)} with = rs Hy and c= (; 
Notice that the density 1s constant on ellipses. This is the reason for calling this family of 
distributions “elliptical”. 


THEOREM 5.11 Elliptical random vectors X have the following properties: 


1. Any linear combination of elliptically distributed variables are elliptical. 
2. Marginal distributions of elliptically distributed variables are elliptical. 


3. A scalar function $(.) can determine an elliptical distribution EC,(u, 4, ¢) for every 
pw € R? and X > 0 with rank(X) = k iff d(t't) is a p-dimensional characteristic 
function. 


4. Assume that X is nondegenerate. If X ~ EC,(p,&,¢) and X ~ EC,(p*, &*, o*), then 
there exists a constant c > 0 such that 


p=w, D=cd*, $(.) = o(c%). 
In other words %,¢, A are not unique, unless we impose the condition that det(X) = 1. 
5. The characteristic function of X,1(t) = E(e#'*) is of the form 
w(t) = ef #6 Et) 


for a scalar function @. 


6. X ~ EC,(u,u,¢) with rank(X) =k iff X has the same distribution as: 
utrAtu®) (5.18) 


where r > 0 is independent of u which is a random vector distributed uniformly on 
the unit sphere surface in R® and A is a(k x p) matrix such that A'A =X. 
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7. Assume that X ~ EC,(u,=,¢) and E(r) < oo. Then 


E(X)=p Cov(X)= oR = —26' (0)E. 


8. Assume that X ~ EC,(p, 4,@) with rank(X) =k. Then 
Q(X) = (X= pT D(X — pw) 


has the same distribution as r? in equation (5.18). 


5.5 Exercises 


1 0 


EXERCISE 5.1 Consider X ~ No(u,%) with wp = (2,2)' and 5 = € 1 


) and the matrices 


i ye 
A= ( ) ,o= ( 1) . Show that AX and BX are independent. 


EXERCISE 5.2 Prove Theorem 5.4. 


EXERCISE 5.3 Prove proposition (c) of Theorem 5.7. 


e9(()4) 
risem((a8s)(04)) 


a) Determine the distribution of Y2 | Yi. 


EXERCISE 5.4 Let 


and 


b) Determine the distribution of W =X —Y. 


X 
EXERCISE 5.5 Consider | Y | ~ N3(y,%). Compute pp and % knowing that 
Z 


Y|Z ~ N,(-Z,1) 
bgy = -3s-—aY 
Rie @& HO 0r 4577), 
Determine the distributions of X | Y and of X | Y + Z. 
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EXERCISE 5.6 Knowing that 


Z i N,(0, 1) 
YZ ne NIG) 
Xo xe Mey 1) 


a) find the distribution of | Y ] and of Y | X,Z. 
Z 


b) find the distribution of 


c) compute E(Y | U = 2). 


EXERCISE 5.7 Suppose ( Y 


za ) ~ No(s, 4) with & positive definite. Is it possible that 


a) xy = 8Y?, 
b) oxxyy =24+Y°, 
c) Ux|y = 3 YY, and 


d) Oxxly =9 ? 


1 Lt. =6-. “2 
EXERCISE 5.8 Let X ~ N3 2),{ —6 10 —4 
3 2-4 6 


a) Find the best linear approximation of X3 by a linear function of X, and Xz and compute 
the multiple correlation between X3 and (X,, Xo). 


b) Let Z\ _ Xo = X3, Zo = Xo + X3 and (23 | Zi, Z2) ~ Ni(Zy + Zo, 10). Compute the 
Z\ 
distribution of | Ze 
23 
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EXERCISE 5.9 Let (X,Y,Z)' be a trivariate normal r.v. with 


Y|Z ~ N,(2Z, 24) 
Z|X ~ N,(2X + 3,14) 
X ~ N,(1,4) 
and pxy = 0.5. 


Find the distribution of (X,Y,Z)' and compute the partial correlation between X and Y for 
fixed Z. Do you think it is reasonable to approximate X by a linear function of Y and Z? 


1 4124 
2 14 21 
EXERCISE 5.10 Let X ~ Ny 3 rl 9 9 46-4 
4 41 19 


a) Give the best linear approximation of Xz as a function of (X1,X4) and evaluate the 
quality of the approximation. 


b) Give the best linear approximation of Xz as a function of (X1,X3,X4) and compare 
your answer with part a). 


EXERCISE 5.11 Prove Theorem 5.2. 
(Hint: complete the linear transformation Z = ea X + Ce and then use Theorem 


5.1 to get the marginal of the first q components of Z.) 


EXERCISE 5.12 Prove Corollaries 5.1 and 5.2. 


6 Theory of Estimation 


We know from our basic knowledge of statistics that one of the objectives in statistics is to 
better understand and model the underlying process which generates the data. This is known 
as statistical inference: we infer from information contained in a sample properties of the 
population from which the observations are taken. In multivariate statistical inference, we 
do exactly the same. The basic ideas were introduced in Section 4.5 on sampling theory: we 
observed the values of a multivariate random variable X and obtained a sample ¥ = {;}",. 
Under random sampling, these observations are considered to be realizations of a sequence 
of ii.d. random variables X,,...,X,, where each X; is a p-variate random variable which 
replicates the parent or population random variable X. In this chapter, for notational 
convenience, we will no longer differentiate between a random variable X; and an observation 
of it, 2;, in our notation. We will simply write x; and it should be clear from the context 
whether a random variable or an observed value is meant. 


Statistical inference infers from the i.i.d. random sample 4 the properties of the population: 
typically, some unknown characteristic 6 of its distribution. In parametric statistics, is a 
k-variate vector 6 € R* characterizing the unknown properties of the population pdf f(x; 4): 
this could be the mean, the covariance matrix, kurtosis, etc. 


The aim will be to estimate @ from the sample through estimators @ which are functions 
of the sample: 9 = 0(¥). When an estimator @ is proposed, we must derive its sampling 
distribution to analyze its properties (is it related to the unknown quantity @ it is supposed 
to estimate’?). 


In this chapter the basic theoretical tools are developed which are needed to derive estima- 
tors and to determine their properties in general situations. We will basically rely on the 
maximum likelihood theory in our presentation. In many situations, the maximum likeli- 
hood estimators indeed share asymptotic optimal properties which make their use easy and 
appealing. 


We will illustrate the multivariate normal population and also the linear regression model 
where the applications are numerous and the derivations are easy to do. In multivariate 
setups, the maximum likelihood estimator is at times too complicated to be derived ana- 
lytically. In such cases, the estimators are obtained using numerical methods (nonlinear 
optimization). The general theory and the asymptotic properties of these estimators remain 
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simple and valid. The following chapter, Chapter 7, concentrates on hypothesis testing and 
confidence interval issues. 


6.1 The Likelihood Function 


Suppose that {#;}"_, is an iid. sample from a population with pdf f(x;@). The aim is to 
estimate 6 € R* which is a vector of unknown parameters. The likelihood function is defined 
as the joint density L(¥;0) of the observations x; considered as a function of 0: 


n 


L(X;6) = [J fais), (6.1) 


i=1 
where ¥ denotes the sample of the data matrix with the observations z],...,2,) in each 
row. The mazimum likelihood estimator (MLE) of @ is defined as 

0 = arg max L(X; 0). 
Often it is easier to maximize the log-likelihood function 

£(X; 6) = log L(¥; 6), (6.2) 


which is equivalent since the logarithm is a monotone one-to-one function. Hence 


j= arg max L(X; O) = arg max (4X; 8). 


The following examples illustrate cases where the maximization process can be performed 
analytically, i.e., we will obtain an explicit analytical expression for 6. Unfortunately, in other 
situations, the maximization process can be more intricate, involving nonlinear optimization 
techniques. In the latter case, given a sample ¥ and the likelihood function, numerical 
methods will be used to determine the value of 6 maximizing L(¥;6@) or ((¥;0). These 
numerical methods are typically based on Newton-Raphson iterative techniques. 


EXAMPLE 6.1 Consider a sample {x;}"_, from Np(u,Z), t.e., from the pdf 


F(e\8) = (2n)-P exp { 50-6)" (a 6)} 


where 0 = ps € R? is the mean vector parameter. The log-likelihood is in this case 


n 


S\ (ai — 9)" (a; — 9). (6.3) 


i=1 


n _ l 
OX; 6) = Da loa{ fle} = log (20)? — = 
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The term (a; — 9)! (a; — 0) equals 
(x; — £)' (2; —F) + (E— 0)" (Z— 0) + 2(% — 6)" (x; — 7). 


Summing this term overi =1,...,n we see that 


nm n 


So (21 — 0)" (ai — 8) = So (ai — 2)" (ai — F) + n(E - 8)" - 8). 


i=1 i=1 


Hence 
n 


0(X;0) = log(2r)-"P/? — ; iC ~%)" (ae, —F) — —(E — 6)" (z— 9). 


Only the last term depends on 0 and is obviously maximized for 
6=f=T. 
Thus & is the MLE of 0 for this family of pdfs f(x, 6). 


A more complex example is the following one where we derive the MLE’s for yw and ™. 


EXAMPLE 6.2 Suppose {x;}%_, is a sample from a normal distribution N,(p,X). Here 
0 = (u,%) with % interpreted as a vector. Due to the symmetry of & the unknown parameter 
0 is in fact {p + $p(p + 1)}-dimensional. Then 


L(¥;0) = [2nB]-"/? exp {-3 re re Ce 0| (6.4) 


i=1 


and 
n 


0(4;) = —2 tog |2nB] — 5 Soi — nw) =a — 1) (6.5) 


i=l 
The term (a; — p)'X~1(a; — p) equals 
G@p-2)' Ei 2) > =p) Ee = pb) 26 —p) 2 ei — 72). 


Summing this term overi =1,...,n we see that 


n nr 


» G — p)'D" (a; — pw) = So (ai —2)'d l(a, — 2) + ne —p)'o1(e—p). 


i=1 i=1 


Note that from (2.14) 


(e;-Z)'U'(a;-2) = tr{(a,-z)'u7(2;-)} 
tr{X7* (a; — Z)(a; —)'}. 
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Therefore, by summing over the index i we finally arrive at 


n n 


Dies WT w=) = EY (ei — 7) — 7)" + 0 — TE" wy) 


i=1 


ir{-1nS} + n(z — p)'X1(z — p). 


Thus the log-likelihood function for N,(p, 2) is 
n n 4 nN Tat 7= 
C(X; 0) =~ 5 log |2nz] — 5 tte S}— 5(e— w) Lu  (% — p). (6.6) 
We can easily see that the third term 1s maximized by 1 = x. In fact the MLE’s are given by 
Z=2, SHS. 


The derivation of © is alot more complicated. It involves derivatives with respect to matrices 
with their notational complexities and will not be presented here: for a more elaborate proof 
see Mardia et al. (1979, p.103-104). Note that the unbiased covariance estimator S, = —"{S 
is not the MLE of 1! 


EXAMPLE 6.3 Consider the linear regression model y; = @'x; +; fori=1,...,n, where 
e; is i.i.d. and N(0,07) and where x; € R?. Here 0 = (3',0) is a (p + 1)-dimensional 
parameter vector. Denote 


Y1 Ty 
UN ee lees Se 

Yn oe 

Then 
a | 1 
L Xx:6 = TD 5 Yt Hs a Z 
(y, ’ ) Ll ee exp 552 ¥ B Z;) \ 

and 


—F log(2n) ~ nlogo — 5 aly — ¥8)"(y ~ ¥8) 


1 
=| log(27) — nlogo — erie +B°X'XB-2B'X"y). 


Differentiating w.r.t. the parameters yields 


Og. < 1 T T 

oe 5g (20 XB —2XTy) 

ra) 1 

aot = 7a t zal -*8)" Ww - xB}. 
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Note that we denotes the vector of the derivatives w.r.t. all components of 3 (the gradient). 


Since the first equation only depends on 3, we start with deriving B. 
XTXB=XTy = Ba(aTX) AT 
Plugging B into the second equation gives 


1 mn ee A 2 
=e XB)" (y-X#B~) = @ = — lly - XB, 


ie) 


where || ||? denotes the Euclidean vector norm from Section 2.6. We see that the MLE B 
is identical with the least squares estimator (3.52). The variance estimator 


eg 1 
a = — Su — Blas)? 
i=1 


is nothing else than the residual sum of squares (RSS) from (3.87) generalized to the case of 
multivariate x;. 


Note that when the x; are considered to be fixed we have 


E(y) = XG and Var(y) = 07h. 


Then, using the properties of moments from Section 4.2 we have 


E(B) = (4° X) #7 Ely) = 8B, (6.7) 


Var(B) = 02(XTX)71. (6.8) 


ia Summary 


— If {x;}", is an iid. sample from a distribution with pdf f(z;0), then 
L(X#;0) = [[j_, f(%i; 9) is the likelihood function. The maximum likeli- 
hood estimator (MLE) is that value of 6 which maximizes L(4; 0). Equiv- 
alently one can maximize the log-likelihood ¢(4; 0). 


<— The MLE’s of pp and ¥ from a N,(, 4) distribution are @ = % and U = S. 
Note that the MLE of © is not unbiased. 


<> The MLE’s of 3 and a in the linear model y = ¥6 +e, € ~ N,(0,07Z) 
are given by the least squares estimator 3 = (¥'X)14'y and G? = 
ally — ¥6]|?. E(B) = 6 and Var(@) = o7(41X)7. 
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6.2 The Cramer-Rao Lower Bound 


As pointed out above, an important question in estimation theory is whether an estimator 
@ has certain desired properties, in particular, if it converges to the unknown parameter 6 
it is supposed to estimate. One typical property we want for an estimator is unbiasedness, 
meaning that on the average, the estimator hits its target: E(@) = 0. We have seen for 
instance (see Example 6.2) that Z is an unbiased estimator of 4 and S is a biased estimator 
of & in finite samples. If we restrict ourselves to unbiased estimation then the natural 
question is whether the estimator shares some optimality properties in terms of its sampling 
variance. Since we focus on unbiasedness, we look for an estimator with the smallest possible 
variance. 


In this context, the Cramer-Rao lower bound will give the minimal achievable variance 
for any unbiased estimator. This result is valid under very general regularity conditions 
(discussed below). One of the most important applications of the Cramer-Rao lower bound 
is that it provides the asymptotic optimality property of maximum likelihood estimators. 
The Cramer-Rao theorem involves the score function and its properties which will be derived 
first. 


The score function s(4’;6) is the derivative of the log likelihood function w.r.t. 6 € R* 


0 1 oO 


8(¥; 8) = 5504; 8) = EAE 


L(X;0). (6.9) 


The covariance matrix F,, = Var{s(4;0)} is called the Fisher information matrix. In what 
follows, we will give some interesting properties of score functions. 


THEOREM 6.1 /f s = s(X;0) is the score function and foO=t= t(¥,0) is any function 
of X and 6, then under regularity conditions 


E(st') = a(t") -E (=) (6.10) 


The proof is left as an exercise (see Exercise 6.9). The regularity conditions required for this 
theorem are rather technical and ensure that the expressions (expectations and derivations) 
appearing in (6.10) are well defined. In particular, the support of the density f(x; 6) should 
not depend on 6. The next corollary is a direct consequence. 


COROLLARY 6.1 Jf s = s(¥;0) is the score function, and 6 = t = t(X) is any unbiased 
estimator of 0 (i.e., E(t) = 6), then 


E(st') = Cov(s,t) = Zp. (6.11) 
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Note that the score function has mean zero (see Exercise 6.10). 
Elsi 4+0)} = 0. (6.12) 


Hence, E(ss') = Var(s) = F, and by setting s = t in Theorem 6.1 it follows that 


2 


REMARK 6.1 /f21,--- ,v, are 2.i.d., F, = nF, where F, is the Fisher information matrix 
for sample size n=1. 


EXAMPLE 6.4 Consider an i.i.d. sample {x;}"_, from N,(@,Z). In this case the parameter 
6 is the mean pw. It follows from (6.3) that: 
0 


aA;0) = 5g 658) 


= 4 {Soe -0"(n-0)| 


i=1 


= n(z— 8). 


Hence, the information matrix is 


Fr = Var{n(z — 0)} = nT,. 


How well can we estimate 0? The answer is given in the following theorem which is due 
to Cramer and Rao. As pointed out above, this theorem gives a lower bound for unbiased 
estimators. Hence, all estimators, which are unbiased and attain this lower bound, are 
minimum variance estimators. 


THEOREM 6.2 (Cramer-Rao) If6=t= t(¥&) is any unbiased estimator for 0, then under 
regularity conditions 
Var(t) > Fy", (6.13) 
where 
Fn = E{s(¥;0)s(¥;0)"} = Var{s(¥;0)} (6.14) 


is the Fisher information matriz. 
Proof: 


Consider the correlation py.z between Y and Z where Y = a't, Z=c's. Here s is the score 
and the vectors a, c € R?. By Corollary 6.1 Cov(s,t) =Z and thus 


Cov(Y,Z) =a' Covit,sjc=a'e 
Var(Z) =c" Var(sie= co" Fac. 
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Hence, 
Cov?(Y, Z) ae” 
= ; = a 6.15 
PGE Var(Y) Var(Z) a! Var(t)a-c’ Fre — ke) 
In particular, this holds for any c # 0. Therefore it holds also for the maximum of the 
left-hand side of (6.15) with respect to c. Since 


c'aa'c TT 
max ———— = max c aac 
ec € Fat c! Fnc=1 
and 
max claa'c=a'F,'a 
ce! Fyc=1 


by our maximization Theorem 2.5 we have 


a’ Fa 
——"— <1 VaeR 0 
a! Var(t)a — : ae 
L.e., 
a'{Var(t)-Fr'}a>0 VaeR’, a¥0, 
which is equivalent to Var(t) > F>'. Oo 


Maximum likelihood estimators (MLE’s) attain the lower bound if the sample size n goes to 
infinity. The next Theorem 6.3 states this and, in addition, gives the asymptotic sampling 
distribution of the maximum likelihood estimation, which turns out to be multinormal. 


THEOREM 6.3 Suppose that the sample {x;}"_, is i.i.d. Ifo is the MLE for @ € R* , i.e., 
) = arg max L(#;0), then under some regularity conditions, as n — oo: 


Jn(0 — 0) + N,(0, Fr) (6.16) 


where F, denotes the Fisher information for sample size n = 1. 


As a consequence of Theorem 6.3 we see that under regularity conditions the MLE is asymp- 
totically unbiased, efficient (minimum variance) and normally distributed. Also it is a con- 
sistent estimator of 6. 


Note that from property (5.4) of the multinormal it follows that asymptotically 
nO — 0)" Fi(0 — 0) 5 x2. (6.17) 
If F, is a consistent estimator of F; (e.g. F; = F.(0)), we have equivalently 


nO — 0)" Fi(@ — 0) 4 x2. (6.18) 
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This expression is sometimes useful in testing hypotheses about @ and in constructing con- 
fidence regions for @ in a very general setup. These issues will be raised in more details in 
the next chapter but from (6.18) it can be seen, for instance, that when n is large, 


i (n@- 0)'Fi(6 — ) = es) ~1— Qa, 


a~ 


where x;7.,, denotes the v-quantile of a x random variable. So, the ellipsoid n(0— OTF (0— 
0) < X}_e:» provides in R? an asymptotic (1 — a)-confidence region for 0. 


mn Summary 


<+ The score function is the derivative s(¥;0) = Ze X; 8) of the log- 
likelihood with respect to 6. The covariance matrix of s(¥; 0) is the Fisher 
information matrix. 


The score function has mean zero: E'{s(4;0)} = 0. 


[ 


The Cramer-Rao bound says that any unbiased estimator 0 = t = t(¥) 
has a variance that is bounded from below by the inverse of the Fisher 
information. Thus, an unbiased estimator, which attains this lower bound, 
is a Minimum variance estimator. 


' 


[ 


For ii.d. data {xz;}", the Fisher information matrix is: F, = nF,. 


MLE’s attain the lower bound in an asymptotic sense, i.e., 


[ 


JVn(0 — 0) => N,(0, Fr) 


if @ is the MLE for 6 € R*, ice., 0 = arg max L(V; 0). 


6.3. Exercises 


EXERCISE 6.1 Consider an uniform distribution on the interval [0,0]. What is the MLE 
of 0? (Hint: the maximization here cannot be performed by means of derivatives. Here the 
support of x depends on 0!) 


EXERCISE 6.2 Consider an 1.i.d. sample of size n from the bivariate population with pdf 
(71 422 

f(z1, 22) = Tan Ca Fe) L1,2 > 0. Compute the MLE of 0 = (01,62). Find the Cramer- 

Rao lower bound. Is it possible to derive a minimal variance unbiased estimator of 0? 
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EXERCISE 6.3 Show that the MLE of Example 6.1, i = %, is a minimal variance estimator 
for any finite sample size n (t.e., without applying Theorem 6.3). 


EXERCISE 6.4 We know from Example 6.4 that the MLE of Example 6.1 has F, = T,. 
This leads to 
Vn( — 2) + N,(0,Z) 


by Theorem 6.3. Can you give an analogous result for the square Z° for the case p = 1? 


EXERCISE 6.5 Consider an i.i.d. sample of size n from the bivariate population with pdf 
ffp%a) = bbe (oy29 +989) 1,22 > 0. Compute the MLE of 0 = (61,02). Find the 


0202 x2 


Cramer-Rao lower bound and the asymptotic variance of 0. 


EXERCISE 6.6 Consider a sample {x;}"_, from N,(t, Xo) where Uo is known. Compute 
the Cramer-Rao lower bound for u. Can you derive a minimal unbiased estimator for pw? 


EXERCISE 6.7 Let X ~ N,(u,X) where & is unknown but we know 
% = diag(oi1, 022,---,; pp). From an i.i.d. sample of size n, find the MLE of w and of &. 


EXERCISE 6.8 Reconsider the setup of the previous exercise. Suppose that 
x= diag(ou, 022,+-- Oe) 


Can you derive in this case the Cramer-Rao lower bound for 0° = (ji... Mp, 011+» Opp) ? 


EXERCISE 6.9 Prove Theorem 6.1. Hint: start from ZE(t") 
then permute integral and derivatives and note that s(¥;0) = 


ft" (XO) L(X; 0) dx, 
L 


EXERCISE 6.10 ate iia a (6.12). 


(Hint: start from E(s =f mer ZL( X;0)L(X;0)OX% and then permute integral and 


derivative. ) 


¢f Hypothesis Testing 


In the preceding chapter, the theoretical basis of estimation theory was presented. Now we 
turn our interest towards testing issues: we want to test the hypothesis Ho that the unknown 
parameter 6 belongs to some subspace of R?. This subspace is called the null set and will 
be denoted by 5 C RY. 


In many cases, this null set corresponds to restrictions which are imposed on the parameter 
space: Hg corresponds to a “reduced model”. As we have already seen in Chapter 3, the 
solution to a testing problem is in terms of a rejection region R which is a set of values in 
the sample space which leads to the decision of rejecting the null hypothesis Ho in favor of 
an alternative H,, which is called the “full model”. 


In general, we want to construct a rejection region R which controls the size of the type I 
error, i.e. the probability of rejecting the null hypothesis when it is true. More formally, a 
solution to a testing problem is of predetermined size a if: 


P(Rejecting Ho | Ho is true) = a. 
In fact, since Ho is often a composite hypothesis, it is achieved by finding R such that 


sup P(X € R| 0) =a. 
0EDO 


In this chapter we will introduce a tool which allows us to build a rejection region in general 
situations: it is based on the likelihood ratio principle. This is a very useful technique 
because it allows us to derive a rejection region with an asymptotically appropriate size 
a. The technique will be illustrated through various testing problems and examples. We 
concentrate on multinormal populations and linear models where the size of the test will 
often be exact even for finite sample sizes n. 


Section 7.1 gives the basic ideas and Section 7.2 presents the general problem of testing linear 
restrictions. This allows us to propose solutions to frequent types of analyses (including 
comparisons of several means, repeated measurements and profile analysis). Each case can 
be viewed as a simple specific case of testing linear restrictions. Special attention is devoted 
to confidence intervals and confidence regions for means and for linear restrictions on means 
in a multinormal setup. 
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7.1 Likelihood Ratio Test 


Suppose that the distribution of {7;}"_,, 2; € R’, depends on a parameter vector 6. We will 
consider two hypotheses: 


A 9 € Do 
Ay . PEQy. 


The hypothesis Hp corresponds to the “reduced model” and H, to the “full model”. This 
notation was already used in Chapter 3. 


EXAMPLE 7.1 Consider a multinormal N,(0,Z). To test if 6 equals a certain fixed value 


4) we construct the test problem: 


Ho 7 46 = 4% 


Hy, : no constraints on @ 


or, equivalently, Qo = {80}, Qi = R?. 


Define Li = max L(*;0), the maxima of the likelihood for each of the hypotheses. Consider 
E82; 
the likelihood ratio (LR) 
LO 


M4) = 5 


(7.1) 


One tends to favor Ho if the LR is high and A, if the LR is low. The lzkelthood ratio test 
(LRT) tells us when exactly to favor Hp over Hy. A likelihood ratio test of size a for testing 
Hp against H, has the rejection region 


R={¥:X) <c} 


where c is determined so that sup P(X € R) =a. The difficulty here is to express c as a 
GENO 
function of a, because \(4) might be a complicated function of ¥. 


Instead of A we may equivalently use the log-likelihood 
—2log A = 2(e7 — £4). 


In this case the rejection region will be R = {¥# : —2log A(X’) > k}. What is the distribution 
of A or of —2 log \ from which we need to compute c or k? 
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THEOREM 7.1 Jf Q, C RY¢ is a q-dimensional space and if Qo C Qy is an r-dimensional 
subspace, then under regularity conditions 


V0 EQ: —2logA  y2_, as n— Oo. 


An asymptotic rejection region can now be given by simply computing the 1 — a quantile 
k =X} _ag-r: The LRT rejection region is therefore 


R={X :-2log AX) > xXt_ag_r}: 


The Theorem 7.1 is thus very helpful: it gives a general way of building rejection regions in 
many problems. Unfortunately, it is only an asymptotic result, meaning that the size of the 
test is only approximately equal to a, although the approximation becomes better when the 
sample size n increases. The question is “how large should n be?”. There is no definite rule: 
we encounter here the same problem that was already discussed with respect to the Central 
Limit Theorem in Chapter 4. 


Fortunatelly, in many standard circumstances, we can derive exact tests even for finite 
samples because the test statistic —2log A(4’) or a simple transformation of it turns out to 
have a simple form. This is the case in most of the following standard testing problems. All 
of them can be viewed as an illustration of the likelihood ratio principle. 


Test Problem 1 is an amuse-bouche: in testing the mean of a multinormal population with a 
known covariance matrix the likelihood ratio statistic has a very simple quadratic form with 
a known distribution under Ho. 


TEST PROBLEM 1 Suppose that Xj,...,X,, is an ii.d. random sample from a N,(j1, ©) 
population. 
Ho: / = po, & known versus H;: no constraints. 


In this case Ho is a simple hypothesis, i.e., Q = {uo} and therefore the dimension r of Qo 
equals 0. Since we have imposed no constraints in H,, the space 9; is the whole R? which 
leads to q = p. From (6.6) we know that 


n J 1 
£5 = &(uo, XU) = 5 log |27X| — il tr(X-"S) — gre — po) 'E7*(E — po). 
Under H; the maximum of ¢(j1, 4) is 


1 
ei = (Z,5) = —5 log [2nd] — sntr(S"1S). 
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Therefore, 
—2log A = 2(07 — 3) = n(F— wo) "EE — to) (7.2) 


which, by Theorem 4.7, has a y?-distribution under Ho. 


EXAMPLE 7.2 Consider the bank data again. Let us test whether the population mean of 
the forged bank notes is equal to 


[io = (214.9, 129.9, 129.7, 8.3, 10.1, 141.5)". 


(This is in fact the sample mean of the genuine bank notes.) The sample mean of the forged 
bank notes is 
FE = (214.8, 130.3, 130.2, 10.5, 11.1, 139.4)". 


Suppose for the moment that the estimated covariance matriz Sp given in (3.5) is the true 
covariance matrix i. We construct the likelihood ratio test and obtain 


—2logh = 2(¢} — £5) = nF — po) "UE — 0) 


7362.32, 


the quantile k = X6.95.6 equals 12.592. The rejection rejection consists of all values in the 
sample space which lead to values of the likelihood ratio test statistic larger than 12.592. 
Under Ho the value of —2logX is therefore highly significant. Hence, the true mean of the 
forged bank notes is significantly different from po! 


Test Problem 2 is the same as the preceding one but in a more realistic situation where 
the covariance matrix is unknown: here the Hotelling’s T?-distribution will be useful to 
determine an exact test and a confidence region for the unknown i. 


TEST PROBLEM 2 Suppose that X1,...,X,, is an ii.d. random sample from a N,(j1, ¥) 
population. 
Ho : 1 = Wo, 4 unknown versus Hj : no constraints. 


Under Ho it can be shown that 


and under H; we have 
EB =LG.8): 


This leads after some calculation to 


—2log \ = 2(@ — &) =nlog(1+d'S'd). (7.4) 
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This statistic is a monotone function of (n—1)d'S~'d. This means that —2log A > k if and 
only if (n—1)d'S~!d > k’. The latter statistic has by Corollary 5.3, under Ho, a Hotelling’s 
T?-distribution. Therefore, 


(n — 1)(& — po) 'S~"(E — to) ~ T*(p,n — 1), (7.5) 


or equivalently 


(SP) (ep) SE ~ Ho) ~ Fane (7.6) 


In this case an exact rejection region may be defined as 


l= = —1/- 
( p 2) (£ => Lo) 'S (e a Lo) > Pichi yGiap: 


Alternatively, we have from Theorem 7.1 that under Hp the asymptotic distribution of the 
test statistic is 


—2log Xr = ve as 2 —> CO 
which leads to the (asymptotically valid) rejection region 
nlog{1 + ( — po)" S~*(% — f0o)} > Xi exp 
but of course, in this case, we would prefer to use the exact F-test provided just above. 
EXAMPLE 7.3 Consider the problem of Example 7.2 again. We know that Sy ts the empir- 
ical analogue for i+, the covariance matrix for the forged banknotes. The test statistic (7.5) 


has the value 1153.4 or its equivalent for the F' distribution in (7.6) is 182.5 which is highly 
significant (Fo.95,6,94 = 2.1966) so that we conclude that ws A bU0- 


Confidence Region for 1 


When estimating a multidimensional parameter 6 € R* from a sample, we saw in Chapter 6 
how to determine the estimator 0 = 0(4). After the sample is observed we end up with a 


point estimate, which is the corresponding observed value of @. We know Ned ) is a random 
variable and we often prefer to determine a confidence region for 0. A confidence region (CR) 
is arandom subset of R* (determined by appropriate statistics) such that we are “confident”, 
at a certain given level 1 — a, that this region contains 60: 


P(O6€CR)=1-a. 


This is just a multidimensional generalization of the basic univariate confidence interval. 
Confidence regions are particularly useful when a hypothesis Ho on @ is rejected, because 
they help in eventually identifying which component of @ is responsible for the rejection. 
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There are only a few cases where confidence regions can be easily assessed, and include most 
of the testing problems on mean presented in this section. 


Corollary 5.3 provides a pivotal quantity which allows confidence regions for jz to be con- 


structed. Since (= (z — p)'S7'( — 1) ~ Fyn—p, we have 


P ((“ - P) (u—2)'S'( — 2) < Fi-can- | =1-a. 


Then, 


OR = {wER?| (1 2)'S Mu 2) <P Fon} 


is a confidence region at level (1-a) for yu. It is the interior of an iso-distance ellipsoid in R? 


centered at Z, with a scaling matrix S~' and a distance constant (2) Fi ipo Whet 
p is large, ellipsoids are not easy to handle for practical purposes. One is thus interested 
in finding confidence intervals for 1, {2,..., 4) SO that simultaneous confidence on all the 


intervals reaches the desired level of say, 1 — a. 


In the following, we consider a more general problem. We construct simultaneous confidence 
intervals for all possible linear combinations a! pz, a € R? of the elements of yu. 


Suppose for a moment that we fix a particular projection vector a. We are back to a standard 
univariate problem of finding a confidence interval for the mean a! pz of a univariate random 
variable a' X. We can use the t-statistics and an obvious confidence interval for a' yu is given 
by the values a! ys such that 


a pu—a'Z) 
Va'Sa 


or equivalently 
2 


eg = Oe ee 


© isaac 


This provides the (1 — a) confidence interval for a! pu: 


Ts a! Sa 
al Z— (Faw a= a wsale+ [Fics n-1___ | - 
um "n-1 i oe 


Now it is easy to prove (using Theorem 2.5) that: 


max t?(a) = (n— 1)\(# — p)'S-1(# — pw) ~ T*(p,n — 1). 


a 


Therefore, simultaneously for all a € R”, the interval 


(aa —/K,a'Sa, a f+ +7 Kaa" Sa) (7.7) 
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where Ky = = F\_a».n—p, Will contain a's with probability (1 — a). 
n—p 3D, p 


A particular choice of a are the columns of the identity matrix Z,, providing simultaneous 
confidence intervals for [41,..., lp. We have therefore with probability (1—a) for j = 1,...,p 


: Pp 7 Pp 
Tj — Vi ph been 955i Shy S74 Ve — 


It should be noted that these intervals define a rectangle inscribing the confidence ellipsoid for 
Lt given above. They are particularly useful when a null hypothesis Hp of the type described 
above is rejected and one would like to see which component(s) are mainly responsible for 
the rejection. 


pe enim P85 (7.8) 


EXAMPLE 7.4 The 95% confidence region for pur, the mean of the forged banknotes, is 
given by the ellipsoid: 


6 
v ER® |(u— z,)'S; ‘(u—ZFf) < srFoanoas} 


The 95% simultaneous confidence intervals are given by (we use Fo.95-6,94 = 2.1966) 


214.692 < yy, < 214.954 
130.205 < po < 130.395 
130.082 < p3 < 130.304 
10.108 < ps < 10.952 
10.896 < ps < 11.370 
139.242 < pg < 139.658. 


Comparing the inequalities with ju = (214.9, 129.9, 129.7, 8.3, 10.1, 141.5)" shows that almost 
all components (except the first one) are responsible for the rejection of Uo in Example 7.2 
and. 7.2. 


In addition, the method can provide other confidence intervals. We have at the same level 
of confidence (choosing a’ = (0, 0, 0, 1, —1, 0)) 


—1.211 < wa — ps < 0.005 


showing that for the forged bills, the lower border is essentially smaller than the upper border. 


REMARK 7.1 Jt should be noted that the confidence region is an ellipsoid whose charac- 
teristics depend on the whole matrix S. In particular, the slope of the axis depends on the 
eigenvectors of S and therefore on the covariances s;;. However, the rectangle inscribing the 
confidence ellipsoid provides the simultaneous confidence intervals for 1;, 7 =1,...,p. They 
do not depend on the covariances s;;, but only on the variances s;; (see (7.8)). In particular, 
it may happen that a tested value {1p is covered by the intervals (7.8) but not covered by the 
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confidence ellipsoid. In this case, [to is rejected by a test based on the confidence ellipsoid 
but not rejected by a test based on the simultaneous confidence intervals. The simultane- 
ous confidence intervals are easier to handle than the full ellipsoid but we have lost some 
information, namely the covariance between the components (see Exercise 7.14). 


The following Problem concerns the covariance matrix in a multinormal population: in this 
situation the test statistic has a slightly more complicated distribution. We will therefore 
invoke the approximation of Theorem 7.1 in order to derive a test of approximate size a. 


TEST PROBLEM 3 Suppose that X,...,X;, is an ii.d. random sample from a N,(1, 4) 
population. 
Hy : 4 = Xo, w unknown versus H;: no constraints. 


Under Ho we have ji = %, and % = No, whereas under H, we have ji = 7, and ¥ = S. Hence 
rk = 1 1 —1 
f= Ue.25) = —anlog |27Uo| — gr tr(Zo S) 
1 1 
& = £2,8)= grog |2n5| — 5 MP 
and thus 


—2logA = 2(€) — 2) 
= ntr(Xp'S) — nlog|Xp'S| — np. 


Note that this statistic is a function of the eigenvalues of Xj'S! Unfortunately, the exact 
finite sample distribution of —2 log \ is very complicated. Asymptotically, we have under Ho 


—2log A = x2, as n— oo 


with m = + {p(p + 1)}, since a (p x p) covariance matrix has only these m parameters as a 
consequence of its symmetry. 


EXAMPLE 7.5 Consider the US companies data set (Table B.5) and suppose we are inter- 
ested in the companies of the energy sector, analyzing their assets (X1) and sales (X2). The 
1.6635 1.2410 


1.2410 1.3747 | . We want to test 


sample is of size 15 and provides the value of S = 10° x 


. 1.2248 1.1425 
Xi) _ 497 

DEK) 1.1425 1.5112 

X, and X_ for the manufacturing sector). The test statistic turns out to be —2 log \ = 2.7365 

which is not significant for x2 (p-value=0.4341). So we can not conclude that = 4 Xo. 


| = Xp. (Uo is in fact the empirical variance matrix for 
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In the next testing problem, we address a question that was already stated in Chapter 3, 
Section 3.6: testing a particular value of the coefficients @ in a linear model. The presentation 
is done in general terms so that it can be built on in the next section where we will test 
linear restrictions on 3. 


TEST PROBLEM 4 Suppose that Yji,...,Y, are independent r.v.’s with Y; ~ 
N,(6'2;,07),2; € R?. 


Ho : 8 = Bo, o? unknown versus H; : no constraints. 


Under Ho we have 8 = 6,3 = +||y— Bol |? and under Hy we have 6 = (474) 147 y, 6? = 
+\\y — XB||? (see Example 6.3). Hence by Theorem 7.1 


—2logX = 2(€5 -— 4) 
=. a4 (wae) 
= TLLOg | — se 
lly — ||? 
mae 9 


We draw upon the result (3.45) which gives us: 
= — X%G,\I? 
pe” ml Pa =) = Fn 
P lly — ¥ |) 


so that in this case we again have an exact distribution. 


EXAMPLE 7.6 Let us consider our “classic blue” pullovers again. In Example 3.11 we 
tried to model the dependency of sales on prices. As we have seen in Figure 3.5 the slope of 
the regression curve is rather small, hence we might ask if (5) = Gee Here 


Y1 L141 1 ip 
y= ; C=] : 
Y10 X10,1 1 X10,2 


The test statistic for the LR test 1s 


—2log A = 9.10 
which under the x3 distribution is significant. The exact F-test statistic 
F = 5.93 


is also significant under the Fy distribution (F>8.0.95 = 4-46). 
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car Summary 


< The hypotheses Hp : 6 € Qo against Hy; : 8 € QQ; can be tested using 
the likelihood ratio test (LRT). The likelihood ratio (LR) is the quotient 
A(¥) = L5/Li where the L¥ are the maxima of the likelihood for each of 
the hypotheses. 


<+ The test statistic in the LRT is \(¥) or equivalently its logarithm 
log A(¥). If Q: is q-dimensional and Q9 C Q; r-dimensional, then the 
asymptotic distribution of —2 log X is one This allows Ho to be tested 
against H, by calculating the test statistic —2logA = 2(¢] — 03) where 
€; = log Lj. 

<— The hypothesis Ho : 4 = fo for X ~ N,(u,X), where © is known, leads 
to —2log A = n(& — po) "UE — po) ~ x2. 


<— The hypothesis Ho : 4 = Uo for X ~ N,(, 4), where © is unknown, leads 
to —2log A = nlog{1 + ( — po)'S*(E — wo)} — x2, and 
(n= 1)(B — poo)" S“ME — po) ~ T{p,n — 1). 

<— The hypothesis Hp : & = Xo for X ~ N,(, 4), where jz is unknown, leads 
to —2log A = ntr (X'S) — nlog|Xp'S| — np — x2,, m = dp(p + 1). 

<> The hypothesis Ho : G = Go for Y; ~ Ni(G'2x;,07), where o? is unknown, 


leads to —2log A = nlog (uae) a ve 


7.2 Linear Hypothesis 


In this section, we present a very general procedure which allows a linear hypothesis to be 
tested, i.e., a linear restriction, either on a vector mean p or on the coefficient @ of a linear 
model. The presented technique covers many of the practical testing problems on means or 
regression coefficients. 


Linear hypotheses are of the form Aj = a with known matrices A(q x p) and a(q x 1) with 
q <P. 


EXAMPLE 7.7 Let ps = (f1, 2)'. The hypothesis that 4, = pz can be equivalently written 


~ Au=(1 -1)(# )=0=0 


be 


The general idea is to test a normal population Hp : Ap = a (restricted model) against the 
full model H, where no restrictions are put on yz. Due to the properties of the multinormal, 
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we can easily adapt the Test Problems 1 and 2 to this new situation. Indeed we know, from 
Theorem 5.2, that y; = Az; ~ Ng(uy, Dy), where py = Au and XL, = ADA! 


Testing the null Hp : Ap = a, is the same as testing Hp : fy = a. The appropriate statistics 
are y and S, which can be derived from the original statistics and S available from %: 


y= Az, S,=ASA'. 


Here the difference between the sample mean and the tested value is d = At — a. We are 
now in the situation to proceed to Test Problem 5 and 6. 


TEST PROBLEM 5 Suppose Xj,..., Xp is an ii.d. random sample from a N,(j, 4%) pop- 
ulation. 
Hy: Au =a, % known versus Hj, : no constraints. 


By (7.2) we have that, under Ho: 
n(Az — a)'(ADAT) (Az — a) ~ 4, 


and we reject Ho if this test statistic is too large at the desired significance level. 


EXAMPLE 7.8 We consider hypotheses on partitioned mean vectors 1s = Gr Let us first 
look at 
Alo : [ty = fz, versus H, : no constraints, 


for Nop( (i); (ce) with known d. This is equivalent to A = (Z,—-T), a =(0,...,0)' € RP 


and leads to: 


—2log A = n(%, — F2)(2U) "(Fy — T2) ~ xe: 


Another example is the test whether 1, = 0, 2.e., 
Alo: p41 = 0, versus Hy : no constraints, 


for Noi"), (co) with known %. This is equivalent to Au = a with A = (Z,0), and 
T 


M2 
a=(022040) R?. Hence: 


QlogA =nzh {ny ~ a 


TEST PROBLEM 6 Suppose Xj,..., Xp is an ii.d. random sample from a N,(j, 4) pop- 
ulation. 
Hy: Au =a, % unknown versus H;: no constraints. 
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From Corollary (5.4) and under Ho it follows immediately that 
(n —1)(Az—a)'(ASA')"'(Az-—a) ~ T?(q,n—1) (7.9) 
since indeed under Ho, 
Az ~ N,(a,n7\AZA') 


is independent of 


nASA™ ~ W,(ADA™,n— 1). 


EXAMPLE 7.9 Let’s come back again to the bank data set and suppose that we want to test 
if [tg = ps, 1.e., the hypothesis that the lower border mean equals the larger border mean for 
the forged bills. In this case: 


A (0001-10) 
a = 0. 


The test statistic is: 
99(Az)'(AS;A") (Az) ~ T?(1, 99) = Figo. 


The observed value is 13.638 which is significant at the 5% level. 


Repeated Measurements 


In many situations, n independent sampling units are observed at p different times or un- 
der p different experimental conditions (different treatments,...). So here we repeat p one- 
dimensional measurements on n different subjects. For instance, we observe the results from 
n students taking p different exams. We end up with a (n x p) matrix. We can thus consider 


the situation where we have X1,...,X,, iid. from a normal distribution N,(j, 4) when there 
are p repeated measurements. The hypothesis of interest in this case is that there are no 
treatment effects, Ho : fly = fy =... = fly. This hypothesis is a direct application of Test 
Problem 6. Indeed, introducing an appropriate matrix transform on jz we have: 
1 -1 0 0 
O 1 -1 -:-. 0 
Ho: Cu=OwhereC((p—1)xp)=]|] . ; . (7.10) 
O--- O 1 -1 


Note that in many cases one of the experimental conditions is the “control” (a placebo, 
standard drug or reference condition). Suppose it is the first component. In that case one 
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is interested in studying differences to the control variable. The matrix C has therefore a 
different form 


1 -1 0O 0 
1 O -l 0 
C((p—1) xp) = ' 
1 0O 0 —1 
By (7.9) the null hypothesis will be rejected if: 
(n-pt+1) 


eo OUCSC ) Ce Sa 4 eee 
Pp — 


As a matter of fact, Cu: is the mean of the random variable y; = Cx; 
yi ~ Np-i(Cy, CxC"). 


Simultaneous confidence intervals for linear combinations of the mean of y; have been derived 
above in (7.7). For all a € R?~1, with probability (1 — a) we have: 


: (p—1) 

a'Cu — a! CE + yf ge O Far init CSCT 

Due to the nature of the problem here, the row sums of the elements in C are zero: Cl, = 0, 

therefore a'C is a vector whose sum of elements vanishes. This is called a contrast. Let 

b=C'a. We have b'1, = >> b; = 0. The result above thus provides for all contrasts of y, 
j= 

and 6! : simultaneous confidence intervals at level (1 — a) 


-1 
bln E b's + yf OSD Ferrin 


Examples of contrasts for p = 4 are b' = (1 —10 0) or (100 —1) oreven (1 —3 —% —4) 
when the control is to be compared with the mean of 3 different treatments. 


EXAMPLE 7.10 Bock (1975) considers the evolution of the vocabulary of children from 
the eighth through eleventh grade. The data set contains the scores of a vocabulary test 
of 40 randomly chosen children that are observed from grades 8 to 11. This is a repeated 
measurement situation, (n = 40,p = 4), since the same children were observed from grades 
8 to 11. The statistics of interest are: 


= = (1.086, 2.544, 2.851, 3.420)" 
2.902 2.438 2.963 2.183 
2.438 3.049 2.775 2.319 
2.963 2.775 4.281 2.939 
2.183 2.319 2.939 3.162 
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Suppose we are interested in the yearly evolution of the children. Then the matrix C providing 
successive differences of 1; 1s: 


1 = 1 -) 
C= 0 1-1 O 
OO ) 1d. = 


The value of the test statistic is Foy, = 53.134 which is highly significant for F337. There are 
significant differences between the successive means. However, the analysis of the contrasts 
shows the following simultaneous 95% confidence intervals 

—1.958 < py—pe2 < —0.959 

—0.949 < po—p3 < 0.335 
Thus, the rejection of Ho is mainly due to the difference between the childrens’ performances 
in the first and second year. The confidence intervals for the following contrasts may also be 
of interest: 


—2.283 < pi — 3(uotystps) < 1.423 
—1.777 < §(u1+pe+ps)—pa < —0.742 


They show that [11 is different from the average of the 3 other years (the same being true for 
fia) and jig turns out to be higher than fg (and of course higher than [,). 


Test Problem 7 illustrates how the likelihood ratio can be applied when testing a linear 
restriction on the coefficient @ of a linear model. It is also shown how a transformation of 
the test statistic leads to an exact F' test as presented in Chapter 3. 


TEST PROBLEM 7 Suppose Y,,...,¥,, are independent with Y; ~ Nj,(@'a;,07), and 
Le € R?. 


Hy : AG = a, o? unknown versus H; : no constraints. 


The constrained maximum likelihood estimators under Hp are (Exercise 3.24): 
B=B-(XTX)TATMAATH) 1A} AAG - 2) 
for B and a? = 2(y — XB)" (y — #8). The estimate 8 denotes the unconstrained MLE as 
before. Hence, the LR statistic is 
—2logA = 2(€ -—&%) 


2 wi (Waser) 
= niog | —_~— 
lly - xl? 
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where qg is the number of elements of a. This problem also has an exact F'-test since 


n—p ( — agdli 7 ‘ _ n—p(A6- a) {AAT X) AT} AG - a) 
q \lly— sll? q Y=Ap) yA) 


Fan—p- 


EXAMPLE 7.11 Let us continue with the “classic blue” pullovers. We can once more test 
if 8 =0 in the regression of sales on prices. It holds that 


B=0 if (0 (5) = 0. 


The LR statistic here is 
—2 log A = 0.284 


which is not significant for the x7 distribution. The F-test statistic 
f= 231 


is also not significant. Hence, we can assume independence of sales and prices (alone). Recall 
that this conclusion has to be revised if we consider the prices together with advertisement 
costs and hours of sales managers. 


Recall the different conclusion that was made in Example 7.6 when we rejected Ho : a = 211 
and 3 = 0. The rejection there came from the fact that the pair of values was rejected. 
Indeed, if G = 0 the estimator of a would be y = 172.70 and this is too far from 211. 


EXAMPLE 7.12 Let us now consider the multivariate regression in the “classic blue” pullovers 
example. From Example 3.15 we know that the estimated parameters in the model 


X, =at BXoq + GoX3 + B3X4+€ 


are 


& = 65.670, 6, = —0.216, 6. =0.485, G3 = 0.844. 


Hence, we could postulate the approximate relation: 


1 
Bye — 5 Pa, 


which means in practice that augmenting the price by 20 EUR requires the advertisement costs 
to increase by 10 EUR in order to keep the number of pullovers sold constant. Vice versa, 
reducing the price by 20 EUR yields the same result as before if we reduced the advertisement 
costs by 10 EUR. Let us now test whether the hypothesis 


1 
Hae rom = — 52 
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is valid. This is equivalent to 


a 
I py 
OE ee = (0), 
CraMle 
Bs 
The LR statistic in this case is equal to (Q@ MVAlrtest.xp1) 


—2log A = 0.012, 


the F' statistic is 
F = 0.007. 


Hence, in both cases we will not reject the null hypothesis. 


Comparison of Two Mean Vectors 


In many situations, we want to compare two groups of individuals for whom a set of p 
characteristics has been observed. We have two random samples {x1 }/", and {xj2}/?, from 
two distinct p-variate normal populations. Several testing issues can be addressed in this 
framework. In Test Problem 8 we will first test the hypothesis of equal mean vectors in the 
two groups under the assumption of equality of the two covariance matrices. This task can 
be solved by adapting Test Problem 2. 


In Test Problem 9 a procedure for testing the equality of the two covariance matrices is 
presented. If the covariance matrices differ, the procedure of Test Problem 8 is no longer 
valid. If the equality of the covariance matrices is rejected, an easy rule for comparing two 
means with no restrictions on the covariance matrices is provided in Test Problem 10. 


TEST PROBLEM 8 Assume that Xj; ~ N,(f1,U), with ¢ = 1,--- ,n1 and 


Xj ~ Np(f2, 4), with 7 = 1,--- ,n2, where all the variables are independent. 


Ho : fy = fz, versus H; : no constraints. 


Both samples provide the statistics , and S;,, k = 1,2. Let 6 = pi, — po. We have 


(%1 — Z2) ~ Np (6 a oy (7.11) 


NyN2 


nS} + noSo ~ W,(%, ny +n —- 2). (7.12) 
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Let S=(n1+n2)~1(n1S1+n25S2) be the weighted mean of S; and Sy. Since the two samples are 
independent and since S; is independent of x, (for k = 1,2) it follows that S is independent 
of (%; — Z2). Hence, Theorem 5.8 applies and leads to a T?-distribution: 


nyno(n1 + ng — 2) 
(ny +72)? 


{ (1 — B2) — 6}' S71 {( — Z) — 6}) ~ T?(p,my + ng—2) (7.18) 


or 
p(n, + 2)? 


(ny + ng —p—1)niNne 


{(@1 — Z) — of" S7' {(@ — %2) — 5} w 


Fon +n2—p-1: 


This result, as in Test Problem 2, can be used to test Ho: 6=0 or to construct a confidence 
region for 6 € R?. The rejection region is given by: 


nyn2(n1 +n2—p—1) 
p(n + Nn)? 


(G1 — 2)! SB — 22) > Fapmim-—p-t- (7.14) 


A (1— a) confidence region for 6 is given by the ellipsoid centered at (%1; — %2) 


p(n +n)? 
(ny + n2 — p—1)(nzNn2) 


{5 — (@ — @)}' S {5 — (2% —2)} < 


ae ee eee 


and the simultaneous confidence intervals for all linear combinations a'6 of the elements of 
0 are given by 


7 Pie = p(n +n2)? 
O€ — + Pesce oO ou 
. . ie 2) la + hyp — 1)(ninz) ena 
In particular we have at the (1 — qa) level, for 7 =1,...,p, 
2 7 p(n + nz)? 
6; € ;— Xo) Py pina pai Sai 7.15 
3 © (Lj — £a,) la + ng —p — 1)(mna) 1—a;p,ni+n2—p—155j (7.15) 


EXAMPLE 7.13 Let us come back to the questions raised in Example 7.5. We compare the 
means of assets (X,) and of sales (X2) for two sectors, energy (group 1) and manufacturing 
(group 2). With ny = 15, no = 10, and p= 2 we obtain the statistics: 


_ (4084 _ (4307.2 
T1=\ o5gq5 J? 727 \ 4995.2 


e seri ae e a ae ee 
1S 302 —- ; 


and 


1.2410 1.3747 1.1425 1.5112 
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so that 


1.2016 1.4293 


The observed value of the test statistic (7.14) is F = 2.7036. Since Fo.95.2.22 = 3.4434 the 
hypothesis of equal means of the two groups is not rejected although it would be rejected at 
a less severe level (F > Fo.90:2,22 = 2.5613). The 95% simultaneous confidence intervals for 
the differences (Q MVAsimcidif.xpl) are given by 


ai ( 1.4880 1.2016 ) 


4628.6 < pia— Haq < 4182.2 
—6662.4 < jis—poe < 1973.0. 


EXAMPLE 7.14 Jn order to illustrate the presented test procedures it is interesting to ana- 
lyze some simulated data. This simulation will point out the importantce of the covariances 
in testing means. We created 2 independent normal samples in R* of sizes ny = 30 and 
Ng = 20 with: 


My = (8, 6, 10, 10)" 
jo = (6,6,10,13)". 


One may consider this as an example of X = (Xq,...,Xn)' being the students’ scores from 
4 tests, where the 2 groups of students were subjected to two different methods of teaching. 
First we simulate the two samples with % = Z4 and obtain the statistics: 


Z = (7.607,5.945, 10.213, 9.635) 


To = (6.222,6.444, 9.560, 13.041)" 


0.812 —0.229 —0.034 0.073 
—0.229 1.001 0.010 —0.059 


= —0.034 0.010 1.078 W—0.098 

0.073 —0.059 —0.098 0.823 

0.559 —0.057 —0.271 0.306 

Ss —0.057 1.237) =60.181 ~——0.021 
2 — 


—0.271 0.181 1.159 —0.130 
0.306 0.021 —0.130 0.683 


The test statistic (7.14) takes the value F = 60.65 which is highly significant: the small 
variance allows the difference to be detected even with these relatively moderate sample sizes. 
We conclude (at the 95% level) that: 


0.6213 < & < 2.2691 
1.5217 < bd < 0.5241 
—0.3766 < 63 < 1.6830 
4.2614 < 6, < —2.5494 


which confirms that the means for X, and X4 are different. 
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Consider now a different simulation scenario where the standard deviations are 4 times 


larger: %& = 16Z,. Here we obtain: 


Ly = 

XQ 
21.907 
1.415 

o1 = —2.050 
2.379 
20.349 
—9.463 

> ie 0.958 
—6.507 


1.415 
11.853 
2.104 
—1.864 


—9.463 

15.502 
—3.383 
—2.551 


(7.312, 6.304, 10.840, 10.902)" 
(6.353, 5.890, 8.604, 11.283)" 


—2.050 
2.104 
17.230 
0.905 


0.958 
—3.383 
14.470 
—0.323 


2.379 
—1.864 
0.905 
9.037 


—6.507 
—2.551 
—0.323 

10.311 


Now the test statistic takes the value 1.54 which is no longer significant (Fo.95,4,45 = 2.58). 
Now we cannot reject the null hypothesis (which we know to be false!) since the increase in 
variances prohibits the detection of differences of such magnitude. 


The following situation illustrates once more the role of the covariances between covariates. 
Suppose that % = 16Z4 as above but with o14 = 04, = —3.999 (this corresponds to a negative 


correlation r4,; = —0.9997). We have: 


Ly = 

ip) 
14.649 
—0.024 

o1 = 1.248 
—3.961 
14.035 
=9370 

ee 5.596 
—1.601 


(8.484, 5.908, 9.024, 10.459) 
(4.959, 7.307, 9.057, 13.803) 


—0.024 
15.825 
0.746 
4.301 


—2.372 

9.173 
—2.027 
—2.954 


1.248 
0.746 
9.446 
1.241 


5.596 
=2.027 
9.021 
—1.301 


—3.961 


4.301 
1.241 


20.002 


—1.601 
—2.954 
—1.301 

9.593 


The value of F is 3.853 which is significant at the 5% level (p-value = 0.0089). So the null 
hypothesis 6 = 44 —[2 = 0 is outside the 95% confidence ellipsoid. However, the simultaneous 
confidence intervals, which do not take the covariances into account are given by: 


—0.1837 
—4,9452 
—3.0091 
—7.2336 


01 
de 
03 
04 


IA IATA IA 


7.2343 
2.1466 
2.9438 
0.5450. 


IA IATA IA 


They contain the null value (see Remark 7.1 above) although they are very asymmetric for 


6, and 64. 
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EXAMPLE 7.15 Let us compare the vectors of means of the forged and the genuine bank 
notes. The matrices Sp and S, were given in Example 3.1 and since here ng =n, = 100, S 
is the simple average of Ss and S,: S = ; (S;+S,). 


Z, = (214.97, 129.94, 129.72, 8.305, 10.168, 141.52)! 
Ep = (214.82, 130.3, 130.19, 10.53, 11.133, 139.45)". 


The test statistic is given by (7.14) and turns out to be F = 391.92 which is highly significant 
for F6193. The 95% simultaneous confidence intervals for the differences 6; = fg; — bri, J = 
1,...,p are: 


0.0443 < 6, < 0.3363 
0.5186 < 69 < —0.1954 
0.6416 < 6; < —0.3044 
—2.6981 < 6, < —1.7519 
1.2952 < 6; < —0.6348 
1.8072 < 65 < 2.3268. 


All of the components (except for the first one) show significant differences in the means. 
The main effects are taken by the lower border (X4) and the diagonal (Xe). 


The preceding test implicitly uses the fact that the two samples are extracted from two 
different populations with common variance \. In this case, the test statistic (7.14) measures 
the distance between the two centers of gravity of the two groups w.r.t. the common metric 
given by the pooled variance matrix S. If ©; # “2 no such matrix exists. There are no 
satisfactory test procedures for testing the equality of variance matrices which are robust with 
respect to normality assumptions of the populations. The following test extends Bartlett’s 
test for equality of variances in the univariate case. But this test is known to be very sensitive 
to departures from normality. 


TEST PROBLEM 9 (Comparison of Covariance Matrices) 
Let Xin ~ No(un, Un), 7=1,...,n2, 2 =1,...,& be independent random variables, 


Hoy : 4) = Ng =--- = Xp versus H;: no constraints. 


Each subsample provides S;,, an estimator of %),, with 
NnSh ~ W,(2n, Mr = 1). 


Under Ho, Syh_, nnSp ~ W,(E,n — k) (Section 5.2), where © is the common covariance 
matrix 2 and n = ~ np. Let S = ea be the weighted average of the S;, (this is 
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in fact the MLE of © when Ap is true). The likelihood ratio test leads to the statistic 


k 
—2log\ = nlog| S| —S— na log | Sp | (7.16) 


h=1 


which under Ho is approximately distributed as a &2 where m = $(k — 1)p(p + 1). 


EXAMPLE 7.16 Let’s come back to Example 7.13, where the mean of assets and sales 
have been compared for companies from the energy and manufacturing sector assuming that 
, = Ny. The test of %; = Ne leads to the value of the test statistic 


~2log A = 0.9076 (7.17) 


which is not significant (p-value for a y3 = 0.82). We cannot reject Hy and the comparison 
of the means performed above is valid. 


EXAMPLE 7.17 Let us compare the covariance matrices of the forged and the genuine bank 
notes (the matrices Sr and S, are shown in Example 3.1). A first look seems to suggest 
that Sy A Ng. The pooled variance S is given by S = + (S; +S,) since here ng = Ng. 
The test statistic here is —2log \ = 127.21, which is highly significant y? with 21 degrees of 
freedom. As expected, we reject the hypothesis of equal covariance matrices, and as a result 
the procedure for comparing the two means in Example 7.15 1s not valid. 


What can we do with unequal covariance matrices? When both n,; and ng are large, we have 
a simple solution: 


TEST PROBLEM 10 (Comparison of two means, unequal covariance matrices, large sam- 
ples) 

Assume that Xj ~ N,(u1, X41), with i= 1,--- ,n; and Xj. ~ N,(pe, U2), with j = 1,--- , ne 
are independent random variables. 


Ho : 41 = [2 versus MH, : no constraints. 


Letting 6 = 1 — pe, we have 


Therefore, by (5.4) 
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Since S; is a consistent estimator of 4; for 7 = 1,2, we have 
S,  Sx\™* c 
= mana is 1 2 = = 2 
— —+4— — : 7.18 
(21 — Za) (= + =: (Zi = Za) > X, (7.18) 


This can be used in place of (7.13) for testing Ho, defining a confidence region for 6 or 
constructing simultaneous confidence intervals for 0;,7 = 1,...,p. 


For instance, the rejection region at the level a will be 


S,  S\" 
(Z, — 2)! (= f =) (Z1 — B2) > Xia (7.19) 
and the (1 — a) simultaneous confidence intervals for 6;, j = 1,...,p are: 
2 
0; € (Zi — £) = ie am a oe (7.20) 


where a is the (j,7) element of the matrix S;. This may be compared to (7.15) where the 
pooled variance was used. 


REMARK 7.2 We see, by comparing the statistics (7.19) with (7.14), that we measure here 
the distance between %, and X2 using the metric (& + &). It should be noticed that when 


Ny = No, the two methods are essentially the same since then S = 5 (Si +S). If the 
covariances are different but have the same eigenvectors (different eigenvalues), one can 
apply the common principal component (CPC) technique, see Chapter 9. 


EXAMPLE 7.18 Let us use the last test to compare the forged and the genuine bank notes 
again (n, and nz are both large). The test statistic (7.19) turns out to be 2436.8 which is 
again highly significant. The 95% simultaneous confidence intervals are: 


—0.0389 <0, < 0.3309 


—0.5140 <0. < —0.2000 
—0.6368 <0d3 < —0.3092 
—2.6846 <6, <  -—1.7654 
—1.2858 <6; < —0.6442 


1.8146 <dg< 2.3194 


showing that all the components except the first are different from zero, the larger difference 
coming from X¢ (length of the diagonal) and X4 (lower border). The results are very similar 
to those obtained in Example (7.15). This is due to the fact that here n; = nz as we already 
mentioned in the remark above. 
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Profile Analysis 


Another useful application of Test Problem 6 is the repeated measurements problem applied 
to two independent groups. This problem arises in practice when we observe repeated 
measurements of characteristics (or measures of the same type under different experimental 
conditions) on the different groups which have to be compared. It is important that the p 
measures (the “profile” ) are comparable and in particular are reported in the same units. 
For instance, they may be measures of blood pressure at p different points in time, one group 
being the control group and the other the group receiving a new treatment. The observations 
may be the scores obtained from p different tests of two different experimental groups. One 
is then interested in comparing the profiles of each group: the profile being just the vectors 
of the means of the p responses (the comparison may be visualized in a two dimensional 
graph using the parallel coordinate plot introduced in Section 1.7). 


We are thus in the same statistical situation as for the comparison of two means: 
Xi ~ Np (t1, 5) t=1,...,m 
Xin ~ Ny (Ue, &) a=1,...,n9 
where all variables are independent. Suppose the two population profiles look like Figure 7.1. 


The following questions are of interest: 


1. Are the profiles similar in the sense of being parallel (which means no interaction 
between the treatments and the groups)? 


2. If the profiles are parallel, are they at the same level? 


3. If the profiles are parallel, is there any treatment effect, i.e., are the profiles horizontal? 


The above questions are easily translated into linear constraints on the means and a test 
statistic can be obtained accordingly. 


Parallel Profiles 


t=) DY a 0 
Let C be a (p—1) x p matrix definedasC={ 0 1 -1-::- 0O 
0 0 1 -tI1 


The hypothesis to be tested is 
H5” : C(1 — po) = 0. 
From (7.11), (7.12) and Corollary 5.4 we know that under Ho: 


Ny1N2 
(ny +72)? 


(ma ++ ng — 2) 1 C(e: — B)}" (CSC") ACG, = 2) wT? (p—1 ty te —2) (7.21) 
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Population profiles 


Group1 


Mean 


Group2 


Treatment 


Figure 7.1. Example of population profiles @ MVAprofil.xpl 


where S is the pooled covariance matrix. The hypothesis is rejected if 


NyN2(n, +4 — p) 
(n1 + n2)?(p — 1) 


(Ca! CSO) Cs SU as sae 


Equality of Two Levels 


The question of equality of the two levels is meaningful only if the two profiles are parallel. 
In the case of interactions (rejection of HO), the two populations react differently to the 
treatments and the question of the level has no meaning. 

The equality of the two levels can be formalized as 


Hj” +1) (11 — ft) = 0 


since 


Tx 7 T M1 +247 
1, (@1 — Za) ~ Ny (1; (11 — M2), an i =I, 
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and 
(ny + n2)1) Sp aed Wi (1, =p, Ny +N2 — 2). 
Using Corollary 5.4 we have that: 
Ss a5 
N1N2 {ly (Zs = 2) } 2 
$+. —2 ~ T*(1 —2 7.22 


= Fed ye 
The rejection region is 


= = 2 
myno(m + M2 — 2) {1p (41 — F2)} 
(ny + n2)? Lol, 


> Fi twine tgs 


Treatment Effect 


If it is rejected that the profiles are parallel, then two independent analyses should be done 
on the two groups using the repeated measurement approach. But if it is accepted that they 
are parallel, then we can exploit the information contained in both groups (eventually at 
different levels) to test a treatment effect, i.e., if the two profiles are horizontal. This may 


be written as: 
H®) + C(t, + uz) = 0. 
Consider the average profile Zz: 


_ 14%, + 2X2 
z= 


ny + Ng 


zw N. fy + Nef 1 > 
. mtn. 'ntne , 


Clearly, 


Now it is not hard to prove that He with Ho implies that 
c (me + malt) ai 
Ny + Ne 


So under parallel, horizontal profiles we have 


Vm +n2Cz ~ N,(0,CXC"). 
From Corollary 5.4 we again obtain 
(ny + ng — 2)(CZ)'(CSC')1CZ ~ T?(p — 1,n1 + no — 2). 
This leads to the rejection region of He), namely 
ni +2 — Pp 


Fp (CB) (CSCI *C > Frnep-tims imap 


(7.23) 
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EXAMPLE 7.19 Morrison (1990) proposed a test in which the results of 4 sub-tests of the 
Wechsler Adult Intelligence Scale (WAIS) are compared for 2 categories of people: group 1 
contains n, = 37 people who do not have a senile factor and group 2 contains ng = 12 people 
who have a senile factor. The four WAIS sub-tests are X, (information), X2 (similarities), 
X3 (arithmetic) and X4 (picture completion). The relevant statistics are 


my = (12.57,9 57 11.49.7907)" 
(8.75, 5.33, 8.50,4.75) " 


to = 
11.164 8.840 6.210 2.020 
— 8.840 11.759 5.778 0.529 
6.210 5.778 10.790 1.743 
2.020 0.529 1.743 3.594 
9.688 9.583 8.875 7.021 
S = 9.583 16.722 11.083 8.167 
5 = 


8.875 11.083 12.083 4.875 
7.021 8.167 4.875 11.688 


The test statistic for testing if the two profiles are parallel is F = 0.4634, which ts not 
significant (p-value = 0.71). Thus it is accepted that the two are parallel. The second test 
statistic (testing the equality of the levels of the 2 profiles) is F = 17.21, which is highly 
significant (p-value ~ 10~*). The global level of the test for the non-senile people is superior 
to the senile group. The final test (testing the horizontality of the average profile) has the 
test statistic F = 53.32, which is also highly significant (p-value ~ 10-4). This implies that 
there are substantial differences among the means of the different subtests. 


iar Summary 


<+ Hypotheses about jz can often be written as Aw = a, with matrix A, and 
vector a. 

— The hypothesis Hp : Au = a for X ~ N,(,%) with © known leads to 
—2logA = n( Az — a)'(AXA')~(Az — a) ~ x2, where q is the number 
of elements in a. 

<— The hypothesis Hp : Au = a for X ~ N,(u,¥) with © unknown leads 
to —2log = nlog{1 + (Az — a)'(AS.A')~!(Az — a)} —> x2, where q 
is the number of elements in a and we have an exact test (n — 1)(Az — 


a)'(ASA')-1(Az% — a) ~ T?(q,n — 1). 
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Summary (continued) 
<> The hypothesis Hp : AG = a for Y; ~ N,(3'2x;,07) with 0? unknown leads 
to —2logA = Flog ( exer - 1) — ver with qg being the length of a 


and with ear 
n—p (Aba) {A (47a) any (AG - a) 
aT F ~ Fan—p- 
val) a8) 


7.3. Boston Housing 


Returning to the Boston housing data set, we are now in a position to test if the means of the 
variables vary according to their location, for example, when they are located in a district 
with high valued houses. In Chapter 1, we built 2 groups of observations according to the 
value of Xj, being less than or equal to the median of X,4 (a group of 256 districts) and 
greater than the median (a group of 250 districts). In what follows, we use the transformed 
variables motivated in Section 1.8. 


Testing the equality of the means from the two groups was proposed in a multivariate setup, 
so we restrict the analysis to the variables X1, X5, Xs, X11, and Xj3 to see if the differences 
between the two groups that were identified in Chapter 1 can be confirmed by a formal test. 
As in Test Problem 8, the hypothesis to be tested is 


Ho: 1 = fa, where pt, € R°,n, = 256, and nz = 250. 


» is not known. The F-statistic given in (7.13) is equal to 126.30, which is much higher 
than the critical value Fo.95.5,500 = 2.23. Therefore, we reject the hypothesis of equal means. 


To see which component, X 1, X5, Xs, X11, or X43, is responsible for this rejection, take a 
look at the simultaneous confidence intervals defined in (7.14): 


5, € (1.4020, 2.5499) 
ds € (0.1315, 0.2383) 
ds € (—0.5344, —0.2222) 
54, € (1.0375, 1.7384) 
di3 € (1.1577, 1.5818). 


These confidence intervals confirm that all of the 6; are significantly different from zero (note 
there is a negative effect for Xg: weighted distances to employment centers) @ MVAsimcibh.xpl. 
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We could also check if the factor “being bounded by the river” (variable X4) has some effect 
on the other variables. To do this compare the means of (X5, Xg, X9, X12, X13, X14) '. There 
are two groups: n; = 35 districts bounded by the river and ng = 471 districts not bounded 
by the river. Test Problem 8 (Ho : 41 = 42) is applied again with p = 6. The resulting 
test statistic, F’ = 5.81, is highly significant (Fo .95.6,499 = 2.12). The simultaneous confidence 
intervals indicate that only X14 (the value of the houses) is responsible for the hypothesis 


being rejected! At a significance level of 0.95 


05 
Og 


Testing Linear Restrictions 


In Chapter 3 a linear model was proposed that explained the variations of the price X14 by 
the variations of the other variables. Using the same procedure that was shown in Testing 
Problem 7, we are in a position to test a set of linear restrictions on the vector of regression 


coefficients (. 


The model we estimated in Section 3.7 provides the following (Q MVAlinregbh. xp1): 


Nnnnmnmn mn 


LN LO ON OS OS ON 


—0.0603, 0.1919 
—0.5225, 0.1527 
—0.5051, 0.5938 
—0.3974, 0.7481 
—0.8595, 0.3782 
0.0014, 0.5084). 


Variable a SE(G;) p-value 
constant 4.1769 0.3790 0.0000 
X1 —0.0146 0.0117 0.2105 
X92 0.0014 0.0056 0.8051 
X3 —0.0127 0.0223 0.5692 
X4 0.1100 0.0366 0.0028 
X5 —0.2831 0.1053 0.0074 
X¢6 0.4211 0.1102 0.0001 
X7 0.0064 0.0049 0.1885 
Xe —0.1832 0.0368 0.0000 
Xo 0.0684 0.0225 0.0025 
X10 —0.2018 0.0484 0.0000 
Xi —0.0400 0.0081 0.0000 
X19 0.0445 0.0115 0.0001 
X13 —0.2626 0.0161 0.0000 
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Recall that the estimated residuals Y — ¥ B did not show a big departure from normality, 
which means that the testing procedure developed above can be used. 


1. First a global test of significance for the regression coefficients is performed, 


Ho* (Big fia) = 0, 


This is obtained by defining A = (013,213) and a = 013 so that Hp is equivalent to 
AGB = a where 6 = (39, 31,.-., 313)'. Based on the observed values F = 123.20. This 
is highly significant (Fo.95-13,492 = 1.7401), thus we reject Hp. Note that under Ho 
Buy = (3.0345,0,...,0) where 3.0345 = 7. 


2. Since we are interested in the effect that being located close to the river has on the 
value of the houses, the second test is Hp : (4, = 0. This is done by fixing 


A = (0,0,0,0,1,0,0,0,0,0,0,0,0,0)" 


and a = 0 to obtain the equivalent hypothesis Hp : AG = a. The result is again 
significant: F’ = 9.0125 (Fo.95:1,492 = 3.8604) with a p-value of 0.0028. Note that this is 
the same p-value obtained in the individual test 3, = 0 in Chapter 3, computed using 
a different setup. 


3. A third test notices the fact that some of the regressors in the full model (3.57) appear 
to be insignificant (that is they have high individual p-values). It can be confirmed from 
a joint test if the corresponding reduced model, formulated by deleting the insignificant 
variables, is rejected by the data. We want to test Hp : 6, = Go = 03 = Br = 0. Hence, 


Of 0O 0 2.0 oo 0 0 2 0-0 
ga [Oo 20 000 OD ees 
“IOLTO1LO0O 00 0 0 0.00 0 
010000 0100 00 0 0 


and a = 04. The test statistic is 0.9344, which is not significant for Fy492. Given that 
the p-value is equal to 0.44, we cannot reject the null hypothesis nor the corresponding 
reduced model. The value of 3 under the null hypothesis is 


Ba = (4.16, 0,0, 0, 0.11, —0.31, 0.47, 0, —0.19, 0.05, —0.20, —0.04, 0.05, —0.26)". 
A possible reduced model is 
Xia = Bo + GaXat 5X5 + FeXe + OaXg+---+ isXiz +. 


Estimating this reduced model using OLS, as was done in Chapter 3, provides the 
results shown in Table 7.3. 


Note that the reduced model has r? = 0.763 which is very close to r? = 0.765 obtained 
from the full model. Clearly, including variables X,, X2,X3, and X7 does not provide 
valuable information in explaining the variation of X44, the price of the houses. 
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Variable ee SE t p-value 
const 4.1582 0.3628 11.462 0.0000 
X4 0.1087 0.0362 2.999 0.0028 
Xs —0.3055 0.0973 -—3.140 0.0018 
X6 0.4668 0.1059 4.407 0.0000 
Xg —0.1855 0.0327  —5.679 0.0000 
Xo 0.0492 0.0183 2.690 0.0074 
X10 —0.2096 0.0446 —4.705 0.0000 
Xi —0.0410 0.0078 —5.280 0.0000 
X12 0.0481 0.0112 4.306 0.0000 
X13 —0.2588 0.0149 —17.396 0.0000 
Table 7.3. Linear Regression for Boston Housing Data Set. 


Q MVAlinreg2bh.xpl 
7.4 Exercises 


EXERCISE 7.1 Use Theorem 7.1 to derive a test for testing the hypothesis that a dice is 
balanced, based on n tosses of that dice. (Hint: use the multinomial probability function.) 


EXERCISE 7.2 Consider N3(,%). Formulate the hypothesis Ho : fy = 2 = [3 in terms 
of A =a: 


EXERCISE 7.3 Simulate a normal sample with w = ee and 4 = Ce =) and test Ho : 


Qhl1 — fo = 0.2 first with % known and then with 4 unknown. Compare the results. 


EXERCISE 7.4 Derive expression (7.3) for the likelihood ratio test statistic in Test Prob- 
lem 2. 


EXERCISE 7.5 With the simulated data set of Example 7.14, test the hypothesis of equality 
of the covariance matrices. 


EXERCISE 7.6 In the U.S. companies data set, test the equality of means between the energy 
and manufacturing sectors, taking the full vector of observations X, to Xs. Derive the 
simultaneous confidence intervals for the differences. 


EXERCISE 7.7 Let X ~ No(u,) where % is known to be 4 = 7 - ) We have 
an i.i.d. sample of sizen = 6 providing z' = (1 t). Solve the following test problems 


(a = 0.05): 


7.4. Exercises 213 


aio: p= (2, ie Ai: pF (2, 2) 
b) Ho: jutjo=§ Mi: wt+toF 
c) Ho: pa-po=5 Me fn—-faF 
d) Ho: [1 = 2 Ay: roan - 2 


For each case, represent the rejection region graphically (comment!). 


NIRNIN 4 


EXERCISE 7.8 Repeat the preceeding exercise with % unknown and S = ( 2 3 ) 


Compare the results. 


EXERCISE 7.9 Consider X ~ N3(,%). An i.t.d. sample of size n = 10 provides: 


a) Knowing that the eigenvalues of S are integers, describe a 95% confidence region for 


3 3 
pu. (Hint: to compute eigenvalues use |S| = [] A; and tr(S) = >> A;). 
j=l j=l 
b) Calculate the simultaneous confidence intervals for [1,2 and [3. 
c) Can we assert that 1, 1s an average of 2 and U3? 


EXERCISE 7.10 Consider two independent i.i.d. samples, each of size 10, from two bivari- 
ate normal populations. The results are summarized below: 


= 3,1) =U)" 


4 —-1 2 —2 
s=(1 3 )i%=(2 %) 
Provide a solution to the following tests: 
a) Ho: fo=po EM: pn F be 
b) Ho: faba Ay: pur FA bar 
c) Ao: fiz = bo. Ay: pie F bee 


Compare the solutions and comment. 


EXERCISE 7.11 Prove expression (7.4) in the Test Problem 2 with log-likelihoods (5 and 
3. (Hint: use (2.29). 
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EXERCISE 7.12 Assume that X ~ N,(,%) where % is unknown. 


a) Derive the log likelihood ratio test for testing the independence of the p components, 
that is Hy: & is a diagonal matrix. (Solution: —2logX = —nlog|R| where R is the 
correlation matriz, which is asymptotically a XI p(p—1) under Ho). 

b) Assume that } is a diagonal matriz (all the variables are independent). Can an asymp- 
totic test for Hy: fb = Uo against Hy: pp # Uo be derived? How would this compare to 
p independent univariate t—tests on each ju; ? 


c) Show an easy derivation of an asymptotic test for testing the equality of the p means 
(Hint: use (CX)'(CSC')"CX — x2_, where S = diag(si,...,Spp) and C is defined 
as in (7.10)). Compare this to the simple ANOVA procedure used in Section 3.5. 


EXERCISE 7.13 The yields of wheat have been measured in 80 parcels that have been ran- 
domly attributed to 3 lots prepared by one of 8 different fertilizer A B and C. The data 
are 


Fertilizer Yield 


So & RA AK WH 
DOOAAAA RRA 
®W ®W L TW WR HH WIO 


Using Exercise 7.12, 


a) test the independence between the 3 variables. 
b) test whether w' = [2 6 4] and compare this to the 3 univariate t—tests. 


c) test whether ju, = [2 = 3 using simple ANOVA and the y? approximation. 
EXERCISE 7.14 Consider an i.i.d. sample of sizen = 5 from a bivariate normal distribution 


rm lo(2 4) 


where p is a known parameter. Suppose Z' = (10). For what value of p would the hypothesis 
Hy: w' = (0 0) be rejected in favor of H,: ' 4 (00) (at the 5% level)? 
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EXERCISE 7.15 Using Example 7.14, test the last two cases described there and test the 
sample number one (n, = 30), to see if they are from a normal population with % = 414 (the 
sample covariance matrix to be used is given by S}). 


EXERCISE 7.16 Consider the bank data set. For the counterfeit bank notes, we want to 
know if the length of the diagonal (X¢) can be predicted by a linear model in X, to Xs. 
Estimate the linear model and test if the coefficients are significantly different from zero. 


EXERCISE 7.17 In Example 7.10, can you predict the vocabulary score of the children in 
eleventh grade, by knowing the results from grades 8-9 and 10? Estimate a linear model and 
test its significance. 


EXERCISE 7.18 Test the equality of the covariance matrices from the two groups in the 
WAIS subtest (Example 7.19). 


EXERCISE 7.19 Prove expressions (7.21), (7.22) and (7.28). 


EXERCISE 7.20 Using Theorem 6.3 and expression (7.16), construct an asymptotic rejec- 


tion region of size a for testing, in a general model f(x, 9), with 0 € R*, 


Hy : 0 = @ against H, : 0 F Op. 


EXERCISE 7.21 Exercise 6.5 considered the pdf f (21,22) = wipe (mm ats) 
11,02 > 0. Solve the problem of testing Ho : 0' = (801,902) from an tid sample of size n on 


x = (21,%2)', where n is large. 


EXERCISE 7.22 In Olkin and Veath (1980), the evolution of citrate concentrations in plasma 
is observed at 3 different times of day, X, (8 am), X_ (11 am) and X3 (3 pm), for two groups 
of patients who follow a different diets. (The patients were randomly attributed to each group 
under a balanced design ny = nz = 5). 

The data are: 


Group X,(8 am) X2(11 am) X3(3 pm) 


125 137 121 
Us FU 17 
I 105 119 125 
151 149 128 
137 139 109 
93 121 107 
116 135 106 
II 109 83 100 
89 95 83 


116 128 100 
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Test if the profiles of the groups are parallel, if they are at the same level and if they are 
horizontal. 


Part III 


Multivariate Techniques 


8 Decomposition of Data Matrices by 
Factors 


In Chapter 1 basic descriptive techniques we developed which provided tools for “looking” at 
multivariate data. They were based on adaptations of bivariate or univariate devices used to 
reduce the dimensions of the observations. In the following three chapters, issues of reducing 
the dimension of a multivariate data set will be discussed. The perspectives will be different 
but the tools will be related. 


In this chapter, we take a descriptive perspective and show how using a geometrical approach 
a “best” way of reducing the dimension of a data matrix can be derived with respect to a 
least-squares criterion. The result will be low dimensional graphical pictures of the data 
matrix. This involves the decomposition of the data matrix into “factors”. These “factors” 
will be sorted in decreasing order of importance. The approach is very general and is the 
core idea of many multivariate techniques. We deliberately use the word “factor” here as a 
tool or transformation for structural interpretation in an exploratory analysis. In practice, 
the matrix to be decomposed will be some transformation of the original data matrix and as 
shown in the following chapters, these transformations provide easier interpretations of the 
obtained graphs in lower dimensional spaces. 


Chapter 9 addresses the issue of reducing the dimensionality of a multivariate random vari- 
able by using linear combinations (the principal components). The identified principal com- 
ponents are ordered in decreasing order of importance. When applied in practice to a data 
matrix, the principal components will turn out to be the factors of a transformed data matrix 
(the data will be centered and eventually standardized). 


Factor analysis is discussed in Chapter 10. The same problem of reducing the dimension of a 
multivariate random variable is addressed but in this case the number of factors is fixed from 
the start. Each factor is interpreted as a latent characteristic of the individuals revealed by 
the original variables. The non-uniqueness of the solutions is dealt with by searching for the 
representation with the easiest interpretation for the analysis. 


Summarizing, this chapter can be seen as a foundation since it develops a basic tool for 
reducing the dimension of a multivariate data matrix. 
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8.1 The Geometric Point of View 


As a matter of introducing certain ideas, assume that the data matrix V(n x p) is composed 
of n observations (or individuals) of p variables. 


There are in fact two ways of looking at ¥, row by row or column by column: 


(1) Each row (observation) is a vector 2} = (rj1,...,2ip) € R?. 


From this point of view our data matrix ¥ is representable as a cloud of n points in 
R? as shown in Figure 8.1. 


cloud of n points of coordinates 2; 


Figure 8.1. 


(2) Each column (variable) is a vector xj) = (t1;..-2nj)' € R”. 


From this point of view the data matrix ¥ is a cloud of p points in R” as shown in 
Figure 8.2. 


* cloud of p points of coordinates x; 


Figure 8.2. 
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When n and/or p are large (larger than 2 or 3), we cannot produce interpretable graphs of 
these clouds of points. Therefore, the aim of the factorial methods to be developed here is 
two-fold. We shall try to simultaneously approximate the column space C(4’) and the row 
space C(4’') with smaller subspaces. The hope is of course that this can be done without 
loosing too much information about the variation and structure of the point clouds in both 
spaces. Ideally, this will provide insights into the structure of Y through graphs in R, R? or 
R?. The main focus then is to find the dimension reducing factors. 


a Summary 


<+ Each row (individual) of ¥ is a p-dimensional vector. From this point of 
view ¥ can be considered as a cloud of n points in R?. 


<+ Each column (variable) of 4 is a n-dimensional vector. From this point 
of view ¥ can be considered as a cloud of p points in R”. 


8.2 Fitting the p-dimensional Point Cloud 


Subspaces of Dimension 1 


In this section ¥ is represented by a cloud of n points in R? (considering each row). The 
question is how to project this point cloud onto a space of lower dimension. To begin consider 
the simplest problem, namely finding a subspace of dimension 1. The problem boils down 
to finding a straight line F, through the origin. The direction of this line can be defined by 
a unit vector u; € R?. Hence, we are searching for the vector u, which gives the “best” fit 
of the initial cloud of n points. The situation is depicted in Figure 8.3. 


The representation of the 7-th individual x; € R? on this line is obtained by the projection 
of the corresponding point onto wy, i.e., the projection point p,,. We know from (2.42) that 
the coordinate of x; on F is given by 


7 UM T 
Pa; = © DT = YG U1. 8.1 
ral (8.1) 


We define the best line F, in the following “least-squares” sense: Find u, € R?” which 


minimizes a 
> IIx — Px; 
i=1 


2 by Pythagoras’s theorem, the problem of minimizing (8.2) 
2. Thus the problem is to find u; € R? that maximizes 


. (8.2) 


|’ 


Since ||:j— pe,||? = |||? — [lee 
is equivalent to maximizing }7"", ||pz, 
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Figure 8.3. 
2 y-1 ||P2; ||? under the constraint ||u:|| = 1. With (8.1) we can write 
Pa, v] Uy 
bia = — = Xu 
Pen Gp Ut 
and the problem can finally be reformulated as: find u; € R? with ||u;|| = 1 that maximizes 
the quadratic form (Vu,)'(¥u,) or 
en ut (X41 X)uy. (8.3) 


The solution is given by Theorem 2.5 (using A= ¥'# and B =T in the theorem). 


THEOREM 8.1 The vector u, which minimizes (8.2) is the eigenvector of X'X associated 
with the largest eigenvalue 1 of X'X. 
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Note that if the data have been centered, i.e., © = 0, then Y = %X, where % is the 
centered data matrix, and 1X 'X is the covariance matrix. Thus Theorem 8.1 says that 
we are searching for a maximum of the quadratic form (8.3) w.r.t. the covariance matrix 
Sy = ee ae a 


Representation of the Cloud on F; 


The coordinates of the n individuals on F are given by Vu,. Vu, is called the first factorial 
variable or the first factor and u, the first factorial axis. The n individuals, x;, are now rep- 
resented by a new factorial variable zj = Yu ,. This factorial variable is a linear combination 
of the original variables (x1),..., Xjp)) whose coefficients are given by the vector uy, i-e., 


2) = U110 14] +eee Ht Up [p]- (8.4) 


Subspaces of Dimension 2 


If we approximate the n individuals by a plane (dimension 2), it can be shown via Theo- 
rem 2.5 that this space contains u;. The plane is determined by the best linear fit (w;) and 
a unit vector uz orthogonal to u; which maximizes the quadratic form uj (¥'X)uz under 
the constraints 

||2|| = 1, and uf w2 =0. 


THEOREM 8.2 The second factorial axis, uz, is the eigenvector of X'X corresponding to 
the second largest eigenvalue A» of X'X. 


The unit vector u2 characterizes a second line, Fh, on which the points are projected. The 
coordinates of the n individuals on Fy are given by z2 = Vuyg. The variable z2 is called 
the second factorial variable or the second factor. The representation of the n individuals in 
two-dimensional space (z; = Vu, vs. 22 = Xuz) is shown in Figure 8.4. 


Subspaces of Dimension gq (q¢ < p) 


In the case of g dimensions the task is again to minimize (8.2) but with projection points in a 
g-dimensional subspace. Following the same argument as above, it can be shown via Theorem 
2.5 that this best subspace is generated by w1, w2,...,Ug, the orthonormal eigenvectors of 
X'X associated with the corresponding eigenvalues A; > Ay >... > Aq. The coordinates 
of the n individuals on the k-th factorial axis, uz, are given by the k-th factorial variable 
2 = Xu, for k = 1,...,q. Each factorial variable z, = (21%, Z2%,---,2nk)' is a linear 
combination of the original variables x1), Vpj,...,2{p) Whose coefficients are given by the 
elements of the k-th vector uz, : 2% = ay LimUmk- 
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x2 
* * 
i-th individual 
7S (acca aise 
1 
* 1 
1 
l 
* 1 
1 
Xi 21 
* 
* 
Figure 8.4. Representation of the individuals 21,...,7, as a two- 


dimensional point cloud. 


ar Summary 


<+ The p-dimensional point cloud of individuals can be graphically repre- 
sented by projecting each element into spaces of smaller dimensions. 


<— The first factorial axis is u; and defines a line F; through the origin. This 
line is found by minimizing the orthogonal distances (8.2). The factor 
uz equals the eigenvector of X'X corresponding to its largest eigenvalue. 
The coordinates for representing the point cloud on a straight line are 
given by 2; = Vu}. 


<> The second factorial axis is u2, where uy denotes the eigenvector of V' X 
corresponding to its second largest eigenvalue. The coordinates for repre- 
senting the point cloud on a plane are given by z, = Vu, and zg = Yup. 


<— The factor directions 1,...,q are u,,...,Ug, which denote the eigenvectors 
of X'X corresponding to the q largest eigenvalues. The coordinates for 
representing the point cloud of individuals on a q-dimensional subspace 
are given by z] = Vuy,...,% = VUg. 
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8.3 Fitting the n-dimensional Point Cloud 


Subspaces of Dimension 1 


Suppose that ¥ is represented by a cloud of p points (variables) in R” (considering each 
column). How can this cloud be projected into a lower dimensional space? We start as 
before with one dimension. In other words, we have to find a straight line G,, which is 
defined by the unit vector v; € R", and which gives the best fit of the initial cloud of p 
points. 


Algebraically, this is the same problem as above (replace ¥ by 4! and follow Section 
8.2): the representation of the j-th variable x) € R” is obtained by the projection of 
the corresponding point onto the straight line G, or the direction v,. Hence we have to find 
v1 such that $7 _, lPx,, ||? is maximized, or equivalently, we have to find the unit vector v, 
which maximizes (¥'v,)'(¥v,) =v] (4X AX"')v,. The solution is given by Theorem 2.5. 


THEOREM 8.3 1, is the eigenvector of XX! corresponding to the largest eigenvalue ju of 
Ads 


Representation of the Cloud on G, 


The coordinates of the p variables on G are given by w; = ¥' vy, the first factorial axis. The 


p variables are now represented by a linear combination of the original individuals 71,..., 2p, 
whose coefficients are given by the vector v1, i.e., for 7 =1,...,p 
Wij = V11%1;5 Sr ee Vintnj- (8.5) 


Subspaces of Dimension gq (q¢ < 7) 


The representation of the p variables in a subspace of dimension q is done in the same 
manner as for the n individuals above. The best subspace is generated by the orthonormal 
eigenvectors U1, U2,...,U¢g of XX! associated with the eigenvalues p14) > pig >... > [lg The 
coordinates of the p variables on the k-th factorial axis are given by the factorial variables 
we = Xl up, k = 1,...,¢. Each factorial variable wy = (wy, Wee,-.-,;Wep)' is a linear 
combination of the original individuals 71, 72,...,%, whose coefficients are given by the 
elements of the k-th vector vg : we; = yy Ugm&mj- The representation in a subspace of 
dimension g = 2 is depicted in Figure 8.5. 
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j-th variable 


Figure 8.5. Representation of the variables 2),...,2[p) as a two- 
dimensional point cloud. 


iar Summary 


<> The n-dimensional point cloud of variables can be graphically represented 
by projecting each element into spaces of smaller dimensions. 


<— The first factor direction is v; and defines a line G; through the origin. 
The vector v, equals the eigenvector of ¥X!' corresponding to the largest 
eigenvalue of ¥4'. The coordinates for representing the point cloud on 
a straight line are w; = X! vj. 


<+ The second factor direction is vg, where v2 denotes the eigenvector of 
XX' corresponding to its second largest eigenvalue. The coordinates for 
representing the point cloud on a plane are given by wy = X'v, and 
T 
W2 = Xx U2. 


<— The factor directions 1,...,q are vj,...,U,g, Which denote the eigenvectors 
of XX! corresponding to the q largest eigenvalues. The coordinates for 
representing the point cloud of variables on a g-dimensional subspace are 
given by wy = V!1,...,Wg = X09. 
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8.4 Relations between Subspaces 


The aim of this section is to present a duality relationship between the two approaches shown 
in Sections 8.2 and 8.3. Consider the eigenvector equations in IR” 


(XX "up = LKve (8.6) 


for k <r, where r = rank(¥4') = rank(4’) < min(p,n). Multiplying by 4’', we have 


XXX" )u, = ppX' vy (8.7) 
or (X'X)(X Tun) = pn(X' vg) (8.8) 


so that each eigenvector v; of XX! corresponds to an eigenvector (4 'v,) of V'X associated 
with the same eigenvalue pi,z. This means that every non-zero eigenvalue of XX! is an 
eigenvalue of ¥'X. The corresponding eigenvectors are related by 


UE = cpX' Up, 


where c; is some constant. 


Now consider the eigenvector equations in R?: 


for k < r. Multiplying by V, we have 
(XX")(Xug) = Ax (Lux), (8.10) 


i.e., each eigenvector u, of 4' X corresponds to an eigenvector Vu, of XX! associated with 
the same eigenvalue ,. Therefore, every non-zero eigenvalue of (¥'#) is an eigenvalue of 
XX'. The corresponding eigenvectors are related by 


UE = dy X up, 


where d; is some constant. Now, since Up Uk = v, Uk = 1 we have c¢, = d, = x This lead 
to the following result: 


THEOREM 8.4 (Duality Relations) Let r be the rank of X. Fork <r, the eigenvalues Az 
of X'X and XX" are the same and the eigenvectors (up and vg, respectively) are related by 


1 
= Wy", 8.11 
Uk i Uk ( ) 


Xup. (8.12) 
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Note that the projection of the p variables on the factorial axis vz is given by 


1 
Wr = A "ay = at =v AK Uk: (8.13) 
k 


Therefore, the eigenvectors uv; do not have to be explicitly recomputed to get we. 


Note that wu, and vz provide the SVD of 4X (see Theorem 2.2). Letting 


U = [ui ue... up|, V = [v1 ve ... vp] and A = diag(Ai,...,Ar) we have 
ASV AU 
so that . 
Lig = >. be Vik Ujk- (8.14) 
k=1 


In the following section this method is applied in analysing consumption behavior across 
different household types. 


nae Summary 


<> The matrices V¥'X and YX! have the same non-zero eigenvalues 
A1,--+;Ar, Where r = rank(%). 


<+ The eigenvectors of Y' XY can be calculated from the eigenvectors of ¥ 4! 
and vice versa: 


X'y, and y= XUp. 


i 1 
U = — ———— 
ee De 


<+ The coordinates representing the variables (columns) of ¥ in a q 
dimensional subspace can be easily calculated by wz = Agu. 


8.5 Practical Computation 


The practical implementation of the techniques introduced begins with the computation of 
the eigenvalues A; > Ap >... > Ap and the corresponding eigenvectors w1,...,Up of VX. 
(Since p is usually less than n, this is numerically less involved than computing v, directly 
for k = 1,...,p). The representation of the n individuals on a plane is then obtained by 
plotting z, = Vu, versus z2 = Vug (z3 = Kuz may eventually be added if a third dimension 
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is helpful). Using the Duality Relation (8.13) representations for the p variables can easily 
be obtained. These representations can be visualized in a scatterplot of wy = V/A; u; against 
W2 = VA2uz2 (and eventually against w3 = \/A3 ug). Higher dimensional factorial resolutions 
can be obtained (by computing z, and wz for k > 3) but, of course, cannot be plotted. 


A standard way of evaluating the quality of the factorial representations in a subspace of 
dimension q is given by the ratio 


Mt At... $Ag 
= 8.15 
"4 Art Ag+... Ap’ cet) 


where 0 < 7, < 1. In general, the scalar product y'y is called the inertia of y € R” w.r-t. 
the origin. Therefore, the ratio T, is usually interpreted as the percentage of the inertia 
explained by the first g factors. Note that A; = (Vu;)'(¥u;) = 2} z;. Thus, , is the inertia 
of the j-th factorial variable w.r.t. the origin. The denominator in (8.15) is a measure of the 
total inertia of the p variables, x). Indeed, by (2.3) 


D p n Pp 
So Ay = (ATH) = SOY, = Sahay. 
j=l it 


j=l i=1 


REMARK 8.1 It is clear that the sum a A; is the sum of the inertia of the first q factorial 
VOTIADLES 2, 29, 2245 Bq. 


EXAMPLE 8.1 We consider the data set in Table B.6 which gives the food expenditures 
of various French families (manual workers = MA, employees = EM, managers = CA) 
with varying numbers of children (2, 3, 4 or 5 children). We are interested in investigating 
whether certain household types prefer certain food types. We can answer this question using 
the factorial approximations developed here. 


The correlation matrix corresponding to the data is 


1.00 0.59 0.20 0.32 0.25 0.86 0.30 
0.59 1.00 0.86 0.88 0.83 0.66 —0.36 
0.20 0.86 1.00 0.96 0.93 0.33 —0.49 
R=| 0.32 088 0.96 1.00 0.98 0.37 —0.44 
0.25 0.83 0.93 0.98 1.00 0.23 —0.40 
0.86 0.66 0.33 0.37 0.23 1.00 0.01 
0.30 —0.36 —0.49 —0.44 —0.40 0.01 — 1.00 


We observe a rather high correlation between meat and poultry, whereas the expenditure for 
milk and wine is rather small. Are there household types that prefer, say, meat over bread? 


We shall now represent food expenditures and households simultaneously using two factors. 
First, note that in this particular problem the origin has no specific meaning (it represents 
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a “zero” consumer). So it makes sense to compare the consumption of any family to that 
of an “average family” rather than to the origin. Therefore, the data is first centered (the 
origin is translated to the center of gravity, £). Furthermore, since the dispersions of the 7 
variables are quite different each variable is standardized so that each has the same weight 
in the analysis (mean 0 and variance 1). Finally, for convenience, we divide each element 
in the matrix by /n = V/12. (This will only change the scaling of the plots in the graphical 
representation. ) 


The data matria to be analyzed is 
1 
A, = HD, 
Jn 


where H is the centering matriz and D = diag(sx,x,) (see Section 3.3). Note that by stan- 
dardizing by \/n, it follows that X,' X, = R where R is the correlation matrix of the original 
data. Calculating 

d = (4.33, 1.83, 0.63, 0.13, 0.06, 0.02, 0.00) " 


shows that the directions of the first two eigenvectors play a dominant role (T2 = 88%), 
whereas the other directions contribute less than 15% of inertia. A two-dimensional plot 
should suffice for interpreting this data set. 


The coordinates of the projected data points are given in the two lower windows of Figure 8.6. 
Let us first examine the food expenditure window. In this window we see the representation 
of the p = 7 variables given by the first two factors. The plot shows the factorial variables 
w, and wz in the same fashion as Figure 8.4. We see that the points for meat, poultry, 
vegetables and fruits are close to each other in the lower left of the graph. The expenditures 
for bread and milk can be found in the upper left whereas wine stands alone in the upper 
right. The first factor, w,, may be interpreted as the meat/fruit factor of consumption, the 
second factor, W2, as the bread/wine component. 


In the lower window on the right-hand side, we show the factorial variables z, and z2 from 
the fit of the n = 12 household types. Note that by the Duality Relations of Theorem 8.4, 
the factorial variables z; are linear combinations of the factors wz from the left window. 
The points displayed in the consumer window (graph on the right) are plotted relative to 
an average consumer represented by the origin. The manager families are located in the 
lower left corner of the graph whereas the manual workers and employees tend to be in the 
upper right. The factorial variables for CA5 (managers with five children) lie close to the 
meat/fruit factor. Relative to the average consumer this household type is a large consumer of 
meat/poultry and fruits/vegetables. In Chapter 9, we will return to these plots interpreting 
them in a much deeper way. At this stage, it suffices to notice that the plots provide a 
graphical representation in R? of the information contained in the original, high-dimensional 
(12 x 7) data matrix. 


8.5 Practical Computation 
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food families 


Z[,2] 


WI.2] 


-0.5 


-129 -0.144 
-746 -0.708 
-047 -0.286 
-806 0.128 
-669 -0.064 
-669 -0.535 
-368 0.542 
-100 0.250 
-632 -0.685 
-087 1.096 
-770 0.447 
-705 -0.040 


fruit -0.929 -0.278 


Figure 8.6. Representation of food expenditures and family types in two 
dimensions. @ MVAdecofood.xpl 


Mar" Summary 


<— The practical implementation of factor decomposition of matrices consists 
of computing the eigenvalues \;,...,A, and the eigenvectors u1,...,U, of 
X!'X. The representation of the n individuals is obtained by plotting z; = 
Xuy vs. 22 = Xue (and, if necessary, vs. 73 = Vu3). The representation 
of the p variables is obtained by plotting wy = Aju, vs. wo = VAgue 
(and, if necessary, vs. w3 = V/A3U3). 


<— The quality of the factorial representation can be evaluated using 7, which 
is the percentage of inertia explained by the first q factors. 
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8.6 Exercises 


EXERCISE 8.1 Prove that n-'Z' Z is the covariance of the centered data matrix, where Z 
is the matrix formed by the columns zp = Vuk. 


EXERCISE 8.2 Compute the SVD of the French food data (Table B.6). 

EXERCISE 8.3 Compute 73,74,... for the French food data (Table B.6). 
EXERCISE 8.4 Apply the factorial techniques to the Swiss bank notes (Table B.2). 
EXERCISE 8.5 Apply the factorial techniques to the time budget data (Table B.14). 


EXERCISE 8.6 Assume that you wish to analyze p independent identically distributed ran- 
dom variables. What is the percentage of the inertia explained by the first factor? What is 
the percentage of the inertia explained by the first q factors? 


EXERCISE 8.7 Assume that you have p i.1.d. r.v.’s. What does the eigenvector, correspond- 
ing to the first factor, look like. 


EXERCISE 8.8 Assume that you have two random variables, X, and Xy = 2X,. What do 
the eigenvalues and eigenvectors of their correlation matrix look like? How many eigenvalues 
are nonzero? 


EXERCISE 8.9 What percentage of inertia is explained by the first factor in the previous 
exercise ? 


EXERCISE 8.10 How do the eigenvalues and eigenvectors in Example 8.1 change if we take 
the prices in $ instead of in EUR? Does it make a difference if some of the prices are in 
EUR and others in $? 


9 Principal Components Analysis 


Chapter 8 presented the basic geometric tools needed to produce a lower dimensional descrip- 
tion of the rows and columns of a multivariate data matrix. Principal components analysis 
has the same objective with the exception that the rows of the data matrix VY will now be 
considered as observations from a p-variate random variable X. The principle idea of re- 
ducing the dimension of X is achieved through linear combinations. Low dimensional linear 
combinations are often easier to interpret and serve as an intermediate step in a more com- 
plex data analysis. More precisely one looks for linear combinations which create the largest 
spread among the values of X. In other words, one is searching for linear combinations with 
the largest variances. 


Section 9.1 introduces the basic ideas and technical elements behind principal components. 
No particular assumption will be made on X except that the mean vector and the covariance 
matrix exist. When reference is made to a data matrix ¥ in Section 9.2, the empirical 
mean and covariance matrix will be used. Section 9.3 shows how to interpret the principal 
components by studying their correlations with the original components of X. Often analyses 
are performed in practice by looking at two-dimensional scatterplots. Section 9.4 develops 
inference techniques on principal components. This is particularly helpful in establishing the 
appropriate dimension reduction and thus in determining the quality of the resulting lower 
dimensional representations. Since principal component analysis is performed on covariance 
matrices, it is not scale invariant. Often, the measurement units of the components of X are 
quite different, so it is reasonable to standardize the measurement units. The normalized 
version of principal components is defined in Section 9.5. In Section 9.6 it is discovered that 
the empirical principal components are the factors of appropriate transformations of the 
data matrix. The classical way of defining principal components through linear combinations 
with respect to the largest variance is described here in geometric terms, i.e., in terms of 
the optimal fit within subspaces generated by the columns and/or the rows of 4 as was 
discussed in Chapter 8. Section 9.9 concludes with additional examples. 


234 9 Principal Components Analysis 


9.1 Standardized Linear Combinations 


The main objective of principal components analysis (PC) is to reduce the dimension of 
the observations. The simplest way of dimension reduction is to take just one element of 
the observed vector and to discard all others. This is not a very reasonable approach, as 
we have seen in the earlier chapters, since strength may be lost in interpreting the data. 
In the bank notes example we have seen that just one variable (e.g. X; = length) had no 
discriminatory power in distinguishing counterfeit from genuine bank notes. An alternative 
method is to weight all variables equally, i.e., to consider the simple average p! an AG OL 
all the elements in the vector X = (Xj,...,X,)'. This again is undesirable, since all of the 
elements of X are considered with equal importance (weight). 


A more flexible approach is to study a weighted average, namely 


p Pp 
5'X=J°6;X; sothat So d7=1. (9.1) 
j=l j=l 
The weighting vector 6 = (61,...,5,)' can then be optimized to investigate and to detect 


specific features. We call (9.1) a standardized linear combination (SLC). Which SLC should 
we choose? One aim is to maximize the variance of the projection 5'X, i.e., to choose 6 
according to 


max Var(d'X) = max 6! Var(X)6. (9.2) 
{6:||6||=1} {6:||6||=1} 


The interesting “directions” of 6 are found through the spectral decomposition of the co- 
variance matrix. Indeed, from Theorem 2.5, the direction 6 is given by the eigenvector 7, 
corresponding to the largest eigenvalue \, of the covariance matrix © = Var(X). 


Figures 9.1 and 9.2 show two such projections (SLCs) of the same data set with zero mean. 
In Figure 9.1 an arbitrary projection is displayed. The upper window shows the data point 
cloud and the line onto which the data are projected. The middle window shows the projected 
values in the selected direction. The lower window shows the variance of the actual projection 
and the percentage of the total variance that is explained. 


Figure 9.2 shows the projection that captures the majority of the variance in the data. This 
direction is of interest and is located along the main direction of the point cloud. The same 
line of thought can be applied to all data orthogonal to this direction leading to the second 
eigenvector. The SLC with the highest variance obtained from maximizing (9.2) is the first 
principal component (PC) y; = 7, X. Orthogonal to the direction 7, we find the SLC with 
the second highest variance: yy = 74 X, the second PC. 


Proceeding in this way and writing in matrix notation, the result for a random variable X 
with E(X) = uw and Var(X) = © =T AT" is the PC transformation which is defined as 


Y=! (X —yp). (9.3) 


Here we have centered the variable X in order to obtain a zero mean PC variable Y. 


9.1 Standardized Linear Combinations 
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Direction in Data 


y 
-1-05 0 05 1 


-2 -1 


explained variance 0.52 
total variance 1.579 


explained percentage 0.29 


Figure 9.1. An arbitrary SLC. Q MVApcasimu.xp1l 


EXAMPLE 9.1 Consider a bivariate normal distribution N(0,%) with & = 


COD O CEU 


and p > 0 


(see Example 3.13). Recall that the eigenvalues of this matrix are \y =1+p and A2 =1-—p 


with corresponding eigenvectors 


if 
71 — /2 1 


The PC transformation 1s thus 


or 


) 


(2) = al 


So the first principal component is 


) G5 


1 


(_ 


i) 
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Direction in Data 


CORA Mine) ODO GO O 


y 
-1-05 0 05 1 


explained variance 1.27 
total variance 1.79 


explained percentage Oe. 71 


Figure 9.2. The most interesting SLC. Q MVApcasimu.xp1l 


and the second is l 


Let us compute the variances of these PCs using formulas (4.22)-(4.26): 


Var(Yi) = Var eas + X:)} = ; Var(X1 + X2) 


1 

= 9 { Var(X1) + Var(X2) + 2 Cov(X1, X»)} 
1 

= \y. 


Similarly we find that 
Var (Yo) = dQ. 


This can be expressed more generally and is given in the next theorem. 
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THEOREM 9.1 For a given X ~ (u,%) let Y = T'(X — p) be the PC transformation. 


Then 
EY;=0, j=1,...,p (9.4) 
Var(Yi)=Ay FH Ayes (9.5) 
Covl¥i,¥))=0, 14; (9.6) 
Var(Y1) > Var(Y2) >--- > Var(Y,) > 0 (9.7) 
3 Var (Y;) = tr(d) (9.8) 
"| var(¥%) = [3 (9.9) 


The connection between the PC transformation and the search for the best SLC is made in 
the following theorem, which follows directly from (9.2) and Theorem 2.5. 


THEOREM 9.2 There exists no SLC that has larger variance than \; = Var(Y;). 


THEOREM 9.3 Jf Y =a'X is a SLC that is not correlated with the first k PCs of X, then 
the variance of Y is maximized by choosing it to be the (k + 1)-st PC. 


an Summary 


<> A standardized linear combination (SLC) is a weighted average 6'X = 
> -10j;X; where 6 is a vector of length 1. 


<+ Maximizing the variance of 6'X leads to the choice 6 = 71, the eigenvec- 
tor corresponding to the largest eigenvalue \, of © = Var(X). 
This is a projection of X into the one-dimensional space, where the com- 
ponents of X are weighted by the elements of y,. Y; = y! (X — p) 
is called the first principal component (PC). 


<< This projection can be generalized for higher dimensions. The PC trans- 
formation is the linear transformation Y = T''(X — yw), where © = 
Var(X) =TAT" and p= EX. 
Y,, Yo,..., Yp are called the first, second,..., and p-th PCs. 


<— The PCs have zero means, variance Var(Y;) = A,;, and zero covariances. 
From A; >... > A, it follows that Var(Yi) >... > Var(Y,). It holds 
that )°_, Var(Y;) = tr(%) and [Jf_, Var(¥;) = |=I. 
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Summary (continued) | 
~~ If Y =a'X isa SLC which is not correlated with the first k PCs of X 


then the variance of Y is maximized by choosing it to be the (k + 1)-st 
PC. 


9.2 Principal Components in Practice 


In practice the PC transformation has to be replaced by the respective estimators: 44 becomes 
Z, 4 is replaced by S, etc. If g; denotes the first eigenvector of S, the first principal 
component is given by y, = (¥ — 1,%')g,. More generally if S = GLG" is the spectral 
decomposition of S, then the PCs are obtained by 


Y=(X -1,2')G. (9.10) 


Note that with the centering matrix H = I — (n~'1,1/) and H1,z' = 0 we can write 


Sy = wo WV RY an 9 (X= 1.2!) AA = 1.2 5¢ 
n GX "HAG =G'SGCSL (9.11) 
where £ = diag(¢),...,,) is the matrix of eigenvalues of S. Hence the variance of y; equals 


the eigenvalue @;! 


The PC technique is sensitive to scale changes. If we multiply one variable by a scalar we 
obtain different eigenvalues and eigenvectors. This is due to the fact that an eigenvalue 
decomposition is performed on of the covariance matrix and not on the correlation matrix 
(see Section 9.5). The following warning is therefore important: 


JN The PC transformation should be applied to data that have approximately the same 
scale in each variable. 


EXAMPLE 9.2 Let us apply this technique to the bank data set. In this example we do not 
standardize the data. Figure 9.3 shows some PC plots of the bank data set. The genuine and 
counterfeit bank notes are marked by “o” and “+” respectively. 


Recall that the mean vector of X is 
FE = (214.9, 130.1, 129.9, 9.4, 10.6, 140.5)". 
The vector of eigenvalues of S is 


£ = (2.985, 0.931, 0.242, 0.194, 0.085, 0.035) |. 
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first vs. second PC second vs. third PC 


pe3 
238.5 239 239.5 240 240.5 241 


-50 


eigenvalues of S 


pe3 
238.5 239 239.5 240 240.5 241 
lambda 


index 


Figure 9.3. Principal components of the bank data. Q MVApcabank.xp1 


The eigenvectors g; are given by the columns of the matrix 


—0.044 0.011 0.326 0.562 —0.753 0.098 
0.112 0.071 0.259 0.455 0.347 —0.767 
0.139 0.066 0.345 0.415 0.5385 0.632 
0.768 —0.563 0.218 —0.186 —0.100 —0.022 
0.202 0.659 0.557 —0.451 —0.102 —0.035 

—0.579 —0.489 0.592 —0.258 0.085 —0.046 


G= 


The first column of G is the first eigenvector and gives the weights used in the linear combi- 
nation of the original data in the first PC. 


EXAMPLE 9.3 To see how sensitive the PCs are to a change in the scale of the variables, 
assume that X1,Xo,X3 and X¢ are measured in cm and that X4 and Xs remain in mm in 
the bank data set. This leads to: 


Z = (21.49, 13.01, 12.99, 9.41, 10.65, 14.05)". 
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first vs. second PC second vs. third PC 


Figure 9.4. Principal components of the rescaled bank data. 
Q MVApcabankr.xpl 


The covariance matrix can be obtained from S in (3.4) by dividing rows 1, 2, 8, 6 and 
columns 1, 2, 3, 6 by 10. We obtain: 


é = (2.101, 0.623, 0.005, 0.002, 0.001, 0.0004)" 
which clearly differs from Example 9.2. Only the first two eigenvectors are given: 
g. = (—0.005, 0.011, 0.014, 0.992, 0.113, —0.052)" 
go = (—0.001, 0.013, 0.016, —0.117, 0.991, —0.069)". 


Comparing these results to the first two columns of G from Example 9.2, a completely different 
story is revealed. Here the first component is dominated by X4 (lower margin) and the second 
by X5 (upper margin), while all of the other variables have much less weight. The results are 
shown in Figure 9.4. Section 9.5 will show how to select a reasonable standardization of the 
variables when the scales are too different. 
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iar" Summary 


<+ The scale of the variables should be roughly the same for PC transforma- 
tions. 


<+ For the practical implementation of principal components analysis (PCA) 
we replace by the mean ¥ and © by the empirical covariance S. Then 
we compute the eigenvalues ¢;,...,@, and the eigenvectors g1,..., 9p of S. 
The graphical representation of the PCs is obtained by plotting the first 
PC vs. the second (and eventually vs. the third). 


<+ The components of the eigenvectors g; are the weights of the original 
variables in the PCs. 


9.3 Interpretation of the PCs 


Recall that the main idea of PC transformations is to find the most informative projections 
that maximize variances. The most informative SLC is given by the first eigenvector. In 
Section 9.2 the eigenvectors were calculated for the bank data. In particular, with centered 
x’s, we had: 


yy = —0.0442, + 0.112r5 + 0.13923 + 0.76824 + 0.202r5 — 0.57926 
yo = 0.011x, + 0.07122 + 0.06623 — 0.56324 + 0.65925 — 0.4892%6 
and 

x; = length 

tq = left height 

x3 = right height 

x4 = bottom frame 

x5 = top frame 

te = diagonal. 


Hence, the first PC is essentially the difference between the bottom frame variable and the 
diagonal. The second PC is best described by the difference between the top frame variable 
and the sum of bottom frame and diagonal variables. 


The weighting of the PCs tells us in which directions, expressed in original coordinates, 
the best variance explanation is obtained. A measure of how well the first g PCs explain 
variation is given by the relative proportion: 
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eigenvalue proportion of variance cumulated proportion 


2.985 0.67 0.67 
0.931 0.21 0.88 
0.242 0.05 0.93 
0.194 0.04 0.97 
0.085 0.02 0.99 
0.035 0.01 1.00 


Table 9.3. Proportion of variance of PC’s 


S- Aj » Var(Y;) 

== . (9.12) 
> Ay > Var(Y;) 
j=l j=l 


Referring to the bank data example 9.2, the (cumulative) proportions of explained variance 
are given in Table 9.3. The first PC (q = 1) already explains 67% of the variation. The first 
three (q = 3) PCs explain 93% of the variation. Once again it should be noted that PCs are 
not scale invariant, e.g., the PCs derived from the correlation matrix give different results 
than the PCs derived from the covariance matrix (see Section 9.5). 


A good graphical representation of the ability of the PCs to explain the variation in the data 
is given by the scree plot shown in the lower righthand window of Figure 9.3. The scree plot 
can be modified by using the relative proportions on the y-axis, as is shown in Figure 9.5 
for the bank data set. 


The covariance between the PC vector Y and the original vector X is calculated with the 
help of (9.4) as follows: 


Cou(X,Y) = E(XY')—EXEY' = B(XY") 

E(XX'T) — pp'T = Var(X)r 

= (9.13) 
PAr'T 

= 1X, 


Hence, the correlation, px,y;, between variable X; and the PC Yj; is 


1/2 
Visrj rj 
eee a . 9.14 
PXiY; (ox,x,A;)¥? Vij ( ) ( ) 


OXiX; 
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Swiss bank notes 


variance explained 


Figure 9.5. Relative proportion of variance explained by PCs. 
Q MVApcabanki.xpl 


Using actual data, this of course translates into 


@, \1/2 
TY; = Gig ( ! ) (9.15) 
SX\X; 
The correlations can be used to evaluate the relations between the PCs Y; where j = 1,..., 4, 


and the original variables X; where 7 = 1,...,p. Note that 


Pp P 2 
i SXiXi SXiX; 


Indeed, ye £593, = g; Lg; is the (i, i)-element of the matrix GLG' = S, so that rey, may 
be seen as the proportion of variance of X; explained by Yj. 

In the space of the first two PCs we plot these proportions, 1.e., rx,y, versus rx,y,. Figure 9.6 
shows this for the bank notes example. This plot shows which of the original variables are 
most strongly correlated with PC Y, and Y. 
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Swiss bank notes 


second PC 


-1 -0.5 0 0.5 1 
first PC 


Figure 9.6. The correlation of the original variable with the PCs. 
Q MVApcabanki.xpl 


From (9.16) it obviously follows that rij, +7rX,y, <1 so that the points are always inside 
the circle of radius 1. In the bank notes example, the variables X4, X5 and X¢ correspond to 
correlations near the periphery of the circle and are thus well explained by the first two PCs. 
Recall that we have interpreted the first PC as being essentially the difference between X4 
and X¢. This is also reflected in Figure 9.6 since the points corresponding to these variables 
lie on different sides of the vertical axis. An analogous remark applies to the second PC. 
We had seen that the second PC is well described by the difference between X5 and the 
sum of X,4 and X¢. Now we are able to see this result again from Figure 9.6 since the point 
corresponding to X;5 lies above the horizontal axis and the points corresponding to X4 and 
X¢ lie below. 


The correlations of the original variables X; and the first two PCs are given in Table 9.4 
along with the cumulated percentage of variance of each variable explained by Y, and Y. 
This table confirms the above results. In particular, it confirms that the percentage of 
variance of X; (and X2, X3) explained by the first two PCs is relatively small and so are 
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rx TXiY2 TX a TX.¥o 
X, length —0.201 0.028 0.041 
Xo left h. 0.538 0.191 0.326 
X3 right h. 0.597 0.159 0.381 
X4 lower 0.921 —0.377 0.991 
X5 upper 0.435 0.794 0.820 


X¢ diagonal —0.870 —0.410 0.926 


Table 9.4. Correlation between the original variables and the PCs 


their weights in the graphical representation of the individual bank notes in the space of the 
first two PCs (as can be seen in the upper left plot in Figure 9.3). Looking simultaneously 
at Figure 9.6 and the upper left plot of Figure 9.3 shows that the genuine bank notes are 
roughly characterized by large values of X¢ and smaller values of X4. The counterfeit bank 
notes show larger values of X; (see Example 7.15). 


ian Summary 


<— The weighting of the PCs tells us in which directions, expressed in original 
coordinates, the best explanation of the variance is obtained. Note that 
the PCs are not scale invariant. 


<< A measure of how well the first g PCs explain variation is given by the 
relative proportion ~, = a=, A;/ = A;. A good graphical representa- 
tion of the ability of the PCs to explain the variation in the data is the 
scree plot of these proportions. 


<> The correlation between PC Yj; and an original variable X; is px,y, = 


2 
£595; r2 
8X,X,° XiY; 


Vi ee For a data matrix this translates into rey; = 
can be interpreted as the proportion of variance of X; explained by Y;. 
A plot of rx,y, vs. ’x,y, shows which of the original variables are most 
strongly correlated with the PCs, namely those that are close to the pe- 


riphery of the circle of radius 1. 


246 9 Principal Components Analysis 


9.4 Asymptotic Properties of the PCs 


In practice, PCs are computed from sample data. The following theorem yields results on 
the asymptotic distribution of the sample PCs. 


THEOREM 9.4 Let» > 0 with distinct eigenvalues, and letU ~ m7!W,(X, m) with spectral 
decompositions & =TAT', andU=GLG'. Then 


(a) vin(l— ) + N,(0, 2A), 
where € = (€1,...,€))" and X= (Ai,...,Ap)' are the diagonals of L and A, 


(b) Ving; - yy) N,(0.Y), 
with V; = Aj En 
y= ye 
(c) Cov(9j; 9x) _ Virs 


AjA rk {sj 
where the (r,s)-element of the matrix Vj,(p x p) is -— one 


[m(Aj — Ax)? ]’ 


(d) the elements in ¢ are asymptotically independent of the elements in G. 


EXAMPLE 9.4 Since nS ~ W,(%,n —1) if X1,...,Xn are drawn from N(u,), we have 
that 
Vin — Ub; — Aj) > N(0,2), f=1,...,p. (9.17) 
Since the variance of (9.17) depends on the true mean A; a log transformation is useful. 
Consider f(€;) = log(€;). Then ant lens = uy and by the Transformation Theorem 4.11 we 
have from (9.17) that 
vn — l(log £; — log A;) —+ N(0, 2). (9.18) 


= 
a (log) — log Ay) = (0,1) 


and a two-sided confidence interval at the 1 — a = 0.95 significance level is given by 


2 2 
log(£;) — 1.964 / aa < log A; < log(¢;) + 1.96 Sa 


In the bank data example we have that 


Hence, 


iy = 2.98: 
Therefore, 
| 2 
log(2.98) + 1.96 ie log(2.98) + 0.1965. 
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It can be concluded for the true eigenvalue that 


P {Az € (2.448, 3.62)} ~ 0.95. 


Variance explained by the first g PCs. 


The variance explained by the first g PCs is given by 


Me $A 

y= aot 

dA; 

j=l 

In practice this is estimated by 

~ bytes + by 
v= - 

2d & 


From Theorem 9.4 we know the distribution of /n — 1(€—A). Since wv is a nonlinear function 
of A, we can again apply the Transformation Theorem 4.11 to obtain that 


Vn —1(b —~)  N(0,D'VD) 
where V = 2A? (from Theorem 9.4) and D = (d;,...,d,)! with 


1-y | 
—— <qa< 
a= oe) HO) ered =e 


a ;) a = 


forgq+1<j<p. 
Given this result, the following theorem can be derived. 


THEOREM 9.5 . 
Vn — 1b — 4)  N(0,w), 


where 
2 


_ 2tr(d?) 2 
= Troy ir(Sy2e” 2G + B) 


and 
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EXAMPLE 9.5 From Section 9.3 it is known that the first PC for the Swiss bank notes 
resolves 67% of the variation. It can be tested whether the true proportion is actually 75%. 
Computing 


a 2 2.985)? 
B4+---+2 — (2.985)? + (0.931)? + --- (0.035) 
i(S) = 4A72 
Pp 
mS) = > 6 = 0.883 
j=l 
Po 2tr(S?) ~ ps5 3 
a SY Gy 9 
2- 9.883 : 
= 1G, — 2(0.902)(0. 902} = 0.142. 
Cag)? (0-668) (0.902) (0.668) + 0.902} = 0 


Hence, a confidence interval at a significance of level 1 — a = 0.95 is given by 


/0.142 
0.668 + 1.96 709 ~ (0.615, 0.720). 


Clearly the hypothesis that W = 75% can be rejected! 


iar Summary 


— The eigenvalues ¢; and eigenvectors g; are asymptotically, normally dis- 
tributed, in particular yn — 1(¢ — A) = N,(0.24"); 


For the eigenvalues it holds that ,/"5* (log €; — log 4;) ay N(0, 1). 
Given an asymptotic, normal distribution approximate confidence inter- 


vals and tests can be constructed for the proportion of variance which is ex- 
plained by the first g PCs. The two-sided confidence interval at the 1—a = 


0.95 level is given by log(é;) — 1.96,/—— < log ; < log(¢;) + 1.96,/—.. 


n-1 — 


[ 


[ 


<>» It holds for w, the estimate of w (the proportion of the variance explained 
by the first g PCs) that Vn — 1( — w) za N(0,w7), where w is given in 
Theorem 9.5. 
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9.5 Normalized Principal Components Analysis 


In certain situations the original variables can be heterogeneous w.r.t. their variances. This 
is particularly true when the variables are measured on heterogeneous scales (such as years, 
kilograms, dollars, ...). In this case a description of the information contained in the data 
needs to be provided which is robust w.r.t. the choice of scale. This can be achieved through 
a standardization of the variables, namely 


Xs =HXD-V? (9.19) 


where D = diag(sx,x,,.--,8x,x,)- Note that 7; = 0 and Sy, = R, the correlation matrix 
of #. The PC transformations of the matrix Vs are refereed to as the Normalized Principal 
Components (NPCs). The spectral decomposition of R is 


R = GrRLRGp, (9.20) 


where Lp = diag(l*,... a and > > 0.2 as are the eigenvalues of R with corresponding 
eigenvectors gf,...,g* (note that here aa é® = tr(R) = p). 


The NPCs, Z;, provide a representation of each individual, and is given by 


2 = AOR = igeseg eo) (9.21) 

After transforming the variables, once again, we have that 
Z = 0, (9.22) 
Sz = GrSx.Gr=GRRGR =Lr. (9.23) 


JN The NPCs provide a perspective similar to that of the PCs, but in terms of the relative 
position of individuals, NPC gives each variable the same weight (with the PCs the variable 
with the largest variance received the largest weight). 


Computing the covariance and correlation between X; and Z; is straightforward: 
1 


Se = tse = GrLlr, (9.24) 
Rxgz = Grlrlpl” =Grbyi’. (9.25) 
The correlations between the original variables X; and the NPCs Z; are: 
XZ; FV XuiZs = Vbi9R,ij (9.26) 
p 


Yorkuz, =1 (9.27) 
j=l 
(compare this to (9.15) and (9.16)). The resulting NPCs, the Z;, can be interpreted in terms 
of the original variables and the role of each PC in explaining the variation in variable X; 
can be evaluated. 
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9.6 Principal Components as a Factorial Method 


The empirical PCs (normalized or not) turn out to be equivalent to the factors that one 
would obtain by decomposing the appropriate data matrix into its factors (see Chapter 8). 
It will be shown that the PCs are the factors representing the rows of the centered data 
matrix and that the NPCs correspond to the factors of the standardized data matrix. The 
representation of the columns of the standardized data matrix provides (at a scale factor) 
the correlations between the NPCs and the original variables. The derivation of the (N)PCs 
presented above will have a nice geometric justification here since they are the best fit in 
subspaces generated by the columns of the (transformed) data matrix VY. This analogy 
provides complementary interpretations of the graphical representations shown above. 


Assume, as in Chapter 8, that we want to obtain representations of the individuals (the 
rows of 1) and of the variables (the columns of 4’) in spaces of smaller dimension. To keep 
the representations simple, some prior transformations are performed. Since the origin has 
no particular statistical meaning in the space of individuals, we will first shift the origin to 
the center of gravity, %, of the point cloud. This is the same as analyzing the centered data 
matrix %o = H#. Now all of the variables have zero means, thus the technique used in 
Chapter 8 can be applied to the matrix X¢. Note that the spectral decomposition of ¥¢ Xo 
is related to that of Sx, namely 


XX = X"HTHAX =nSx = nGLlG'. (9.28) 
The factorial variables are obtained by projecting Xo on G, 
Y= Aad = inten Up): (9.29) 


These are the same principal components obtained above, see formula (9.10). (Note that 

the y’s here correspond to the z’s in Section 8.2.) Since HX¥co = Xo, it immediately follows 
that 

y= 0, (9.30) 

Sy = G'SxG=L=diag(ty,..., &). (9.31) 

The scatterplot of the individuals on the factorial axes are thus centered around the origin 


and are more spread out in the first direction (first PC has variance ¢,) than in the second 
direction (second PC has variance 2). 


The representation of the variables can be obtained using the Duality Relations (8.11), 


and (8.12). The projections of the columns of X¢ onto the eigenvectors vw, of HoX¢ are 


1 
Xd Up = <= Xb Koen = V bGe- (9.32) 


Vv ney, 


Thus the projections of the variables on the first p axes are the columns of the matrix 


XAV = VnGei?. (9.33) 
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Considering the geometric representation, there is a nice statistical interpretation of the 
angle between two columns of Vc. Given that 


Loy) Tle = NSX;Xz5 (9.34) 


ecuall? = N8X,X;55 (9.35) 


where xoyj) and £cjxj denote the j-th and k-th column of %¢, it holds that in the full space 
of the variables, if 0;;, is the angle between two variables, rcj) and xo, then 


ey UCtR 


cos O54 = = TX, Xy (9.36) 


Izcull cull 
(Example 2.11 shows the general connection that exists between the angle and correlation of 
two variables). As a result, the relative positions of the variables in the scatterplot of the first 
columns of ¥ V may be interpreted in terms of their correlations; the plot provides a picture 
of the correlation structure of the original data set. Clearly, one should take into account 
the percentage of variance explained by the chosen axes when evaluating the correlation. 


The NPCs can also be viewed as a factorial method for reducing the dimension. The variables 
are again standardized so that each one has mean zero and unit variance and is independent 
of the scale of the variables. The factorial analysis of Vs provides the NPCs. The spectral 
decomposition of VJ Xz is related to that of R, namely 


Xy ks =D VXTHYD-V? =nR = nGrlerGp. 
The NPCs Z;, given by (9.21), may be viewed as the projections of the rows of V5 onto Gr. 


The representation of the variables are again given by the columns of 
XI VR = VnGrly’. (9.37) 


Comparing (9.37) and (9.25) we see that the projections of the variables in the factorial 
analysis provide the correlation between the NPCs Z; and the original variables «,,) (up to 
the factor \/n which could be the scale of the axes). 


This implies that a deeper interpretation of the representation of the individuals can be 
obtained by looking simultaneously at the graphs plotting the variables. Note that 


£51) © th] = NX; Xp (9.38) 
Izsyll? = 2, (9.39) 


where xgj;) and xg; denote the j-th and k-th column of 4s. Hence, in the full space, all 
the standardized variables (columns of ¥s) are contained within the “sphere” in R", which 
is centered at the origin and has radius \/n (the scale of the graph). As in (9.36), given the 
angle 0;, between two columns xg,;j and wg), it holds that 


COS O54 = TX; Xp (9.40) 


252 9 Principal Components Analysis 


Therefore, when looking at the representation of the variables in the spaces of reduced 
dimension (for instance the first two factors), we have a picture of the correlation structure 
between the original X;’s in terms of their angles. Of course, the quality of the representation 
in those subspaces has to be taken into account, which is presented in the next section. 


Quality of the representations 


As said before, an overall measure of the quality of the representation is given by 


i Bac’, 
= Z : 
df 
j=l 


Y 


In practice, g is chosen to be equal to 1, 2 or 3. Suppose for instance that wy = 0.93 for q = 2. 
This means that the graphical representation in two dimensions captures 93% of the total 
variance. In other words, there is minimal dispersion in a third direction (no more than 7%). 


It can be useful to check if each individual is well represented by the PCs. Clearly, the 
proximity of two individuals on the projected space may not necessarily coincide with the 
proximity in the full original space R’, which may lead to erroneous interpretations of the 
graphs. In this respect, it is worth computing the angle vj, between the representation of 
an individual i and the k-th PC or NPC axis. This can be done using (2.40), ie., 


Yi Ck Yik 
cos Ui, = = 
lIyllllecll — |lvcell 
for the PCs or analogously _ 
cos Cp = 4; Ck - Zik 
I[zillllexll — [lasel| 


for the NPCs, where e; denotes the k-th unit vector e, = (0,...,1,...,0)'. An individual 
i will be represented on the k-th PC axis if its corresponding angle is small, i.e., if cos? Viz, 
for k = 1,...,p is close to one. Note that for each individual 7, 


P 7 T Aart 

Yi Vi LOGG Loi 
) cos? Ujp = = wal 
fai LojXCi LojiXCi 


The values cos? J;, are sometimes called the relative contributions of the k-th axis to the 
representation of the i-th individual, e.g., if cos? Jj; + cos? Viz is large (near one), we know 
that the individual 7 is well represented on the plane of the first two principal axes since its 
corresponding angle with the plane is close to zero. 


We already know that the quality of the representation of the variables can be evaluated by 
the percentage of X;’s variance that is explained by a PC, which is given by rey; or rx, Zi; 
according to (9.16) and (9.27) respectively. 
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French food data 
a 


0.5 


second factor - families 
0 


-0.5 


-1 -0.5 0 0.5 
first factor - families 


Figure 9.7. Representation of the individuals. Q MVAnpcafood.xpl 


EXAMPLE 9.6 Let us return to the French food expenditure example, see Appendix B.6. 
This yields a two-dimensional representation of the individuals as shown in Figure 9.7. 


Calculating the matrix Gr we have 


—0.240 0.622 —0.011 —0.544 0.036 0.508 
—0.466 0.098 —0.062 —0.023 —0.809 —0.301 
—0.446 —0.205 0.145 0.548 —0.067 0.625 
Gp= | —0.462 —0.141 0.207 —0.053 0.411 —0.093 |, 
—0.438 —0.197 0.356 —0.324 0.224 —0.350 
—0.281 0.523 —0.444 0.450 0.341 —0.332 
0.206 0.479 0.780 0.306 —0.069 —0.138 


which gives the weights of the variables (milk, vegetables, etc.). The eigenvalues £; and the 
proportions of explained variance are given in Table 9.7. 


The interpretation of the principal components are best understood when looking at the cor- 
relations between the original X;’s and the PC's. Since the first two PCs explain 88.1% of 
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French food data 


second factor - goods 


first factor - goods 


Figure 9.8. Representation of the variables. Q MVAnpcafood.xpl 


eigenvalues proportion of variance cumulated proportion 


4.333 0.6190 61.9 
1.830 0.2620 88.1 
0.631 0.0900 97.1 
0.128 0.0180 98.9 
0.058 0.0080 Led 
0.019 0.0030 99.9 
0.001 0.0001 100.0 


Table 9.7. Eigenvalues and explained variance 


the variance, we limit ourselves to the first two PCs. The results are shown in Table 9.8. 
The two-dimensional graphical representation of the variables in Figure 9.8 1s based on the 
first two columns of Table 9.8. 
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YXiZ1 YXiZ2 XZ + 1X2 
X,: bread —0.499 0.842 0.957 
Xy: vegetables —0.970 0.133 0.958 
X3: fruits —0.929 —0.278 0.941 
X4: meat —0.962 —0.191 0.962 
X5: poultry —0.911 —0.266 0.901 
Xe: milk —0.584 0.707 0.841 
X7: wine 0.428 0.648 0.604 


Table 9.8. Correlations with PCs 


The plots are the projections of the variables into R?. Since the quality of the representation 
is good for all the variables (except maybe Xz), their relative angles give a picture of their 
original correlation: wine is negatively correlated with the vegetables, fruits, meat and poultry 
groups (0 > 90°), whereas taken individually this latter grouping of variables are highly 
positively correlated with each other (0 ~ 0). Bread and milk are positively correlated but 
poorly correlated with meat, fruits and poultry (0 = 90°). 


Now the representation of the individuals in Figure 9.7 can be interpreted better. From 
Figure 9.8 and Table 9.8 we can see that the the first factor Z, 1s a vegetable-meat—poultry— 
fruit factor (with a negative sign), whereas the second factor is a milk-bread—wine factor 
(with a positive sign). Note that this corresponds to the most important weights in the first 
columns of Gr. In Figure 9.7 lines were drawn to connect families of the same size and 
families of the same professional types. A grid can clearly be seen (with a slight deformation 
by the manager families) that shows the families with higher expenditures (higher number of 
children) on the left. 


Considering both figures together explains what types of expenditures are responsible for sim- 
ilarities in food expenditures. Bread, milk and wine expenditures are similar for manual 
workers and employees. Families of managers are characterized by higher expenditures on 
vegetables, fruits, meat and poultry. Very often when analyzing NPCs (and PCs), it is il- 
luminating to use such a device to introduce qualitative aspects of individuals in order to 
enrich the interpretations of the graphs. 


Mar" Summary 


<— NPCs are PCs applied to the standardized (normalized) data matrix 4s. 
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Summary (continued) 


<+ The graphical representation of NPCs provides a similar type of picture 
as that of PCs, the difference being in the relative position of individuals, 
i.e., each variable in NPCs has the same weight (in PCs, the variable with 
the largest variance has the largest weight). 


<— The quality of the representation is evaluated by WW = 
( 7A meet) +bot...+ ae 

<— The quality of the representation of a variable can be evaluated by the 
percentage of X;’s variance that is explained by a PC, i-e., Ky, : 


9.7 Common Principal Components 


In many applications a statistical analysis is simultaneously done for groups of data. In this 
section a technique is presented that allows us to analyze group elements that have common 
PCs. From a statistical point of view, estimating PCs simultaneously in different groups 
will result in a joint dimension reducing transformation. This multi-group PCA, the so 
called common principle components analysis (CPCA), yields the joint eigenstructure across 
groups. 


In addition to traditional PCA, the basic assumption of CPCA is that the space spanned 
by the eigenvectors is identical across several groups, whereas variances associated with the 
components are allowed to vary. 


More formally, the hypothesis of common principle components can be stated in the following 
way (Flury, 1988): 
Horo: =TAT’, i=1,..,k 


where ¥; is a positive definite p x p population covariance matrix for every i, [ = (71, ..., Yp) 
is an orthogonal p x p transformation matrix and A; = diag(Aj1,...,Aip) is the matrix of 
eigenvalues. Moreover, assume that all A; are distinct. 


Let S be the (unbiased) sample covariance matrix of an underlying p-variate normal distri- 
bution N,(j, 4) with sample size n. Then the distribution of nS has n—1 degrees of freedom 
and is known as the Wishart distribution (Muirhead, 1982, p. 86): 


nS ~ W,(%,n — 1). 


The density is given in (5.16). Hence, for a given Wishart matrix S; with sample size n;, the 
likelihood function can be written as 


k 
1 Ips. 
iy.) = CTJexp{tx (-5(n = s;"8:)} b> eed (9.41) 
1=1 
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where C' is a constant independent of the parameters );. Maximizing the likelihood is 
equivalent to minimizing the function 


k 


g(Sa, --- Ex) = Soni — y{n bee t(D; 715i). 


i=1 


Assuming that Hgpc holds, i.e., in replacing ©; by TA,I', after some manipulations one 


obtains : 
Pp T 
= 5 Si; 
gtr Day aap iG) = So (ni _ 1) > (1 rij a rij . 


i=l j=l 


As we know from Section 2.2, the vectors y; in I have to be orthogonal. Orthogonality of the 
vectors Yj is achieved using the Lagrange method, i.e., we impose the p constraints 4% =I 
using the Lagrange multipliers ;, and the remaining p(p — 1)/2 constraints yy; = 0 for 
h # Jj using the multiplier 2j,; (Flury, 1988). This yields 


Pp Pp Pp 
g(T, As, An) =90-) — Sof -D-2>5 0 wait y. 
j=l 


h=1 j=h+1 


Taking partial derivatives with respect to all Ain and ym, it can be shown that the solution 
of the CPC model is given by the generalized system of characteristic equations 


k 
Aim — Aaj 


This system can be solved using 


Min = SVs i=1,...,k, m=1,...,p 


0 mA 
vi | 


under the constraints 


1 m=j 


Flury (1988) proves existence and uniqueness of the maximum of the likelihood function, 
and Flury and Gautschi (1986) provide a numerical algorithm. 


EXAMPLE 9.7 As an example we provide the data sets XFGvolsurf01, XFGvolsurf02 and 
XFGvolsurf03 that have been used in Fengler, Hardle and Villa (2001) to estimate common 
principle components for the implied volatility surfaces of the DAX 1999. The data has been 
generated by smoothing an implied volatility surface day by day. Next, the estimated grid 
points have been grouped into maturities of T = 1, T= 2 and tT =3 months and transformed 


258 9 Principal Components Analysis 


into a vector of time series of the “smile”, i.e., each element of the vector belongs to a 
distinct moneyness ranging from 0.85 to 1.10. 


Figure 9.9 shows the first three eigenvectors in a parallel coordinate plot. The basic structure 
of the first three eigenvectors is not altered. We find a shift, a slope and a twist structure. 
This structure is common to all maturity groups, 1.e., when exploiting PCA as a dimension 
reducing tool, the same transformation applies to each group! However, by comparing the 
size of eigenvalues among groups we find that variability is decreasing across groups as we 
move from the short term contracts to long term contracts. 


PCP for CPCA, 3 eigenvectors 

ra Es 

SS 
2 
3° t 
8 

” 

2 L 

1 2 3 4 5 6 
moneyness 


Figure 9.9. Factor loadings of the first (thick), the second (medium), and 
the third (thin) PC @ MVAcpcaiv.xpl 


Before drawing conclusions we should convince ourselves that the CPC model is truly a good 
description of the data. This can be done by using a likelihood ratio test. The likelihood 
ratio statistic for comparing a restricted (the CPC) model against the unrestricted model 
(the model where all covariances are treated separately) is given by 
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Inserting the likelihood function, we find that this 1s equivalent to 


: “~ 
det (i) 
Ln jrro,.--se) _ don - VN act (Si) 


i=1 


which has a x? distribution as min(n;) tends to infinity with 


Kf nlp =a 1} = {np = 1) kp} = (hk = 1)p(p = 1) 


degrees of freedom. This test is included in the quantlet Q MVAcpcaiv.xpl. 


The calculations yield Tiny ng,....ng) = 31.836, which corresponds to the p-value p = 0.37512 
for the x?(30) distribution. Hence we cannot reject the CPC model against the unrestricted 
model, where PCA is applied to each maturity separately. 


Using the methods in Section 9.3, we can estimate the amount of variability, ¢,, explained by 
the first | principle components: (only a few factors, three at the most, are needed to capture 
a large amount of the total variability present in the data). Since the model now captures 
the variability in both the strike and maturity dimensions, this is a suitable starting point for 
a simplified VaR calculation for delta-gamma neutral option portfolios using Monte Carlo 
methods, and is hence a valuable insight in risk management. 


9.8 Boston Housing 


A set of transformations were defined in Chapter | for the Boston Housing data set that 
resulted in “regular” marginal distributions. The usefulness of principal component analysis 
with respect to such high-dimensional data sets will now be shown. The variable X4 is 
dropped because it is a discrete 0-1 variable. It will be used later, however, in the graphical 
representations. The scale difference of the remaining 13 variables motivates a NPCA based 
on the correlation matrix. 


The eigenvalues and the percentage of explained variance are given in Table 9.10. 


The first principal component explains 56% of the total variance and the first three compo- 
nents together explain more than 75%. These results imply that it is sufficient to look at 2, 
maximum 3, principal components. 


Table 9.11 provides the correlations between the first three PC’s and the original variables. 
These can be seen in Figure 9.10. 


The correlations with the first PC show a very clear pattern. The variables X2, X6, Xg, X12, 
and Xj4 are strongly positively correlated with the first PC, whereas the remaining variables 
are highly negatively correlated. The minimal correlation in the absolute value is 0.5. The 
first PC axis could be interpreted as a quality of life and house indicator. The second axis, 
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eigenvalue percentages cumulated percentages 


7.2852 0.5604 0.5604 
1.3517 0.1040 0.6644 
1.1266 0.0867 0.7510 
0.7802 0.0600 0.8111 
0.6359 0.0489 0.8600 
0.5290 0.0407 0.9007 
0.3397 0.0261 0.9268 
0.2628 0.0202 0.9470 
0.1936 0.0149 0.9619 
0.1547 0.0119 0.9738 
0.1405 0.0108 0.9846 
0.1100 0.0085 0.9931 
0.0900 0.0069 1.0000 


Table 9.10. Eigenvalues and percentage of explained variance for Boston 
Housing data. @ MVAnpcahous.xp1l 


PC, PC, PCS 
X,  —0.9076 0.2247 = 0.1457 
Xo 0.6399 —0.0292 0.5058 
X3 —0.8580 0.0409 —0.1845 
Xs —0.8737 0.2391 —0.1780 
X6 0.5104 0.7037 0.0869 
X7  —0.7999 0.1556 —0.2949 
Xg 0.8259 —0.2904 0.2982 
Xg —0.7531 0.2857 0.3804 
Xig9 —0.8114 0.1645 0.3672 
Xi, —0.5674 —0.2667 0.1498 
X12 0.4906 —0.1041 —0.5170 
Xi3° —0.7996 —0.4253 —0.0251 
X44 0.7366 0.5160 —0.1747 


Table 9.11. Correlations of the first three PC’s with the original variables. 
Q MVAnpcahous.xpl 


given the polarities of X,; and X 3 and of X¢ and Xj4, can be interpreted as a social factor 
explaining only 10% of the total variance. The third axis is dominated by a polarity between 
Xo and X42. 
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Boston housing Boston housing 


second PC 


third PC 


Figure 9.10. NPCA for the Boston housing data, correlations of first three 
PCs with the original variables. Q MVAnpcahousi.xpl 


The set of individuals from the first two PCs can be graphically interpreted if the plots are 
color coded with respect to some particular variable of interest. Figure 9.11 color codes 
Xi4 > median as red points. Clearly the first and second PCs are related to house value. 
The situation is less clear in Figure 9.12 where the color code corresponds to X4, the Charles 
River indicator, i.e., houses near the river are colored red. 


9.9 More Examples 


EXAMPLE 9.8 Let us now apply the PCA to the standardized bank data set (Table B.2). 
Figure 9.13 shows some PC plots of the bank data set. The genuine and counterfeit bank 
notes are marked by “o” and “+” respectively. 


262 9 Principal Components Analysis 


first vs. second PC 


Figure 9.11. NPC analysis for the Boston housing data, scatterplot of 
the first two PCs. More expensive houses are marked with red color. 
Q MVAnpcahous.xpl 


The vector of eigenvalues of R is 
£ = (2.946, 1.278, 0.869, 0.450, 0.269, 0.189)". 
The eigenvectors g; are given by the columns of the matrix 


—0.007 —0.815 0.018 0.575 0.059 0.031 
0.468 —0.342 —0.103 —0.395 —0.639 —0.298 
0.487 —0.252 —0.123 —0.430 0.614 0.349 
0.407 0.266 —0.584 0.404 0.215 —0.462 
0.368 0.091 0.788 0.110 0.220 —0.419 

—0.493 —0.274 —0.114 —0.392 0.340 —0.632 


gG= 


Each original variable has the same weight in the analysis and the results are independent 
of the scale of each variable. 
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first vs. second PC 


Figure 9.12. NPC analysis for the Boston housing data, scatterplot of the 
first two PCs. Houses close to the Charles River are indicated with red 
squares. Q MVAnpcahous.xpl 


b5 


proportion of variances cumulated proportion 


2.946 
1.278 
0.869 
0.450 
0.264 
0.189 


0.491 
0.213 
0.145 
0.075 
0.045 
0.032 


49.1 
70.4 
84.9 
92.4 
96.9 
100.0 


Table 9.12. Eigenvalues and proportions of explained variance 


The proportions of explained variance are given in Table 9.12. It can be concluded that the 
representation in two dimensions should be sufficient. The correlations leading to Figure 9.14 
are given in Table 9.13. The picture is different from the one obtained in Section 9.8 (see Ta- 
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first vs. second PC 


second vs. third PC 


Figure 9.13. 


Principal components 


Q MVAnpcabank.xpl 


of the standardized bank data. 


ROMS CMS ie oe 
X,: length —0.012 —0.922 0.85 
Xq9: left height 0.803 —0.387 0.79 
X3: right height 0.8385 —0.285 0.78 
X4: lower 0.698 0.301 0.58 
X»5: upper 0.631 0.104 0.41 
X¢: diagonal —0.847 —0.310 0.81 


Table 9.13. Correlations with PCs 


ble 9.4). Here, the first factor is mainly a left-right vs. diagonal factor and the second one is 
a length factor (with negative weight). Take another look at Figure 9.13, where the individual 
bank notes are displayed. In the upper left graph it can be seen that the genuine bank notes 
are for the most part in the south-eastern portion of the graph featuring a larger diagonal, 
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Swiss bank notes 


second PC 


-1 -0.5 ) 0.5 1 
first PC 


Figure 9.14. The correlations of the original variable with the PCs. 
Q MVAnpcabanki.xpl 


smaller height (Z, < 0) and also a larger length (Zz < 0). Note also that Figure 9.14 gives 
an idea of the correlation structure of the original data matriz. 


EXAMPLE 9.9 Consider the data of 79 U.S. companies given in Table B.5. The data is 
first standardized by subtracting the mean and dividing by the standard deviation. Note that 


the data set contains six variables: assets (X,), sales (X2), market value (X3), profits (X4), 
cash flow (Xs), number of employees (X¢). 


Calculating the corresponding vector of eigenvalues gives 


£ = (5.039, 0.517, 0.359, 0.050, 0.029, 0.007) " 
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and the matrix of eigenvectors 1s 


0.340 —0.849 —0.339 0.205 0.077 —0.006 
0.423 —0.170 0.379 —0.783 —0.006 —0.186 
0.434 0.190 —0.192 0.071 —0.844 0.149 
0.420 0.3864 —0.324 0.156 0.261 —0.703 
0.428 0.285 —0.267 —0.121 0.452 0.667 
0.397 0.010 0.726 0.548 0.098 0.065 


Using this information the graphical representations of the first two principal components 
are given in Figure 9.15. The different sectors are marked by the following symbols: 


Mi Tech and Communication 
Energy 

Finance 

Manufacturing 

Retail 

all other sectors. 


+ aly 


The two outliers in the right-hand side of the graph are IBM and General Electric (GE), 
which differ from the other companies with their high market values. As can be seen in the 
first column of G, market value has the largest weight in the first PC, adding to the isolation 
of these two companies. If IBM and GE were to be excluded from the data set, a completely 
different picture would emerge, as shown in Figure 9.16. In this case the vector of eigenvalues 


becomes 
€ = (3.191, 1.535, 0.791, 0.292, 0.149, 0.041)" , 


and the corresponding matrix of eigenvectors is 


0.263 —0.408 —0.800 —0.067 0.333 0.099 
0.438 —0.407 0.162 —0.509 —0.441 —0.403 
0.500 —0.003 —0.035 0.801 —0.264 —0.190 
0.331 0.623 —0.080 —0.192 0.426 —0.526 
0.443 0.450 —0.123 —0.238 —0.335 0.646 
0.427 —0.277 0.558 0.021 0.575 0.313 


The percentage of variation explained by each component is given in Table 9.14. The first 
two components explain almost 79% of the variance. The interpretation of the factors (the 
axes of Figure 9.16) is given in the table of correlations (Table 9.15). The first two columns 
of this table are plotted in Figure 9.17. 


From Figure 9.17 (and Table 9.15) it appears that the first factor is a “size effect”, it is 
positively correlated with all the variables describing the size of the activity of the companies. 
It is also a measure of the economic strength of the firms. The second factor describes the 
“shape” of the companies ( “profit-cash flow” vs. “assets-sales” factor), which is more difficult 
to interpret from an economic point of view. 
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first vs. second PC 


eigenvalues of S 


index 


Figure 9.15. Principal components of the U.S. company data. 
Q MVAnpcausco.xpl 


i. proportion of variance cumulated proportion 
3.191 0.532 0.532 
1.535 0.256 0.788 
0.791 0.132 0.920 
0.292 0.049 0.968 
0.149 0.025 0.993 
0.041 0.007 1.000 


Table 9.14. Eigenvalues and proportions of explained variance. 


EXAMPLE 9.10 Volle (1985) analyzes data on 28 individuals (Table B.14). For each indi- 
vidual, the time spent (in hours) on 10 different activities has been recorded over 100 days, as 
well as informative statistics such as the individual’s sex, country of residence, professional 
activity and matrimonial status. The results of a NPCA are given below. 
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first vs. second PC 


a 
i} 
w 4 
Nn 
lon 


index 


Figure 9.16. Principal components of the U.S. company data (without 
IBM and General Electric). Q MVAnpcausco2.xpl 


YXiZa YX iZo rx, + XZ 
X 1: assets 0.47 —0.510 0.48 
Xo»: sales 0.78 —0.500 0.87 
X3: market value 0.89 —0.003 0.80 
X4: profits 0.59 0.770 0.95 
Xs: cash flow 0.79 0.560 0.94 
X¢: employees 0.76 —0.340 0.70 


Table 9.15. Correlations with PCs. 


The eigenvalues of the correlation matrix are given in Table 9.16. Note that the last eigen- 
value is exactly zero since the correlation matrix is singular (the sum of all the variables is 
always equal to 2400 = 24 x 100). The results of the 4 first PCs are given in Tables 9.17 
and 9.18. 
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second PC 


U.S. company data 


Figure 9.17. 


The correlation of the original variables with the PCs. 


Q MVAnpcausco2i.xpl 


first PC 


5 


proportion of variance cumulated proportion 


4.59 
2.12 
1.32 
1.20 
0.47 
0.20 
0.05 
0.04 
0.02 
0.00 


0.459 
0.212 
0.132 
0.120 
0.047 
0.020 
0.005 
0.004 
0.002 
0.000 


0.460 
0.670 
0.800 
0.920 
0.970 
0.990 
0.990 
0.999 
1.000 
1.000 


Table 9.16. Eigenvalues of correlation matrix for the time budget data. 
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time budget data 


0 + 0.01 * second factor - individuals 


-15 -10 5 0 5 
0+ 0.01 * first factor - individuals 


Figure 9.18. Representation of the individuals. @ MVAnpcatime.xpl 


TX;Wi TX ;We TX;Ws TX;Wa 
X,: ~~ prof 0.9772 —0.1210 —0.0846 0.0669 
Xo: tran 0.9798 0.0581 —0.0084 0.4555 
X3: hous —0.8999 0.0227 0.3624 0.2142 
X4: kids —0.8721 0.1786 0.0837 0.2944 
Xs5: shop —0.5636 0.7606 —0.0046 —0.1210 
Xe: pers —0.0795 0.8181 —0.3022 —0.0636 
X7: eati —0.5883 —0.6694 —0.4263 0.0141 
Xg: slee —0.6442 —0.5693 —0.1908 —0.3125 
Xg: tele —0.0994 0.1931 —0.9300 0.1512 
X10: leis —0.0922 0.1103 0.0302 —0.9574 


Table 9.17. Correlation of variables with PCs. 
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time budget data 


second factor - expenditures 


first factor - expenditures 


Figure 9.19. Representation of the variables. Q MVAnpcatime.xpl 


From these tables (and Figures 9.18 and 9.19), it appears that the professional and house- 
hold activities are strongly contrasted in the first factor. Indeed on the horizontal axis of 
Figure 9.18 it can be seen that all the active men are on the right and all the inactive women 
are on the left. Active women and/or single women are inbetween. The second factor con- 
trasts meal/sleeping vs. toilet/shopping (note the high correlation between meal and sleeping). 
Along the vertical axis of Figure 9.18 we see near the bottom of the graph the people from 
Western-European countries, who spend more time on meals and sleeping than people from 
the U. S. (who can be found close to the top of the graph). The other categories are inbetween. 


In Figure 9.19 the variables television and other leisure activities hardly play any role (look at 
Table 9.17). The variable television appears in Z3 (negatively correlated). Table 9.18 shows 
that this factor contrasts people from Eastern countries and Yugoslavia with men living in the 
U.S. The variable other leisure activities is the factor Z4. It merely distinguishes between men 
and women in Eastern countries and in Yugoslavia. These last two factors are orthogonal to 
the preceeding axes and of course their contribution to the total variation is less important. 


22 9 Principal Components Analysis 
Ly LZ Z3 LA 
maus 0.0633 0.0245 —0.0668 0.0205 
waus 0.0061 0.0791 —0.0236 0.0156 
wnus —0.1448 0.0813 —0.0379 —0.0186 
mmus 0.0635 0.0105 —0.0673 0.0262 
wmus —0.0934 0.0816 —0.0285 0.0038 
msus 0.0537 0.0676 —0.0487 —0.0279 
wsus 0.0166 0.1016 —0.0463 —0.0053 
mawe 0.0420 —0.0846 —0.0399 —0.0016 
wawe —0.0111 —0.0534 —0.0097 0.0337 
wnwe —0.1544 —0.0583 —0.0318 —0.0051 
mmwe 0.0402 —0.0880 —0.0459 0.0054 
wmwe —0.1118 —0.0710 —0.0210 0.0262 
mswe 0.0489 —0.0919 —0.0188 —0.0365 
wswe —0.0393 —0.0591 —0.0194 —0.0534 
mayo 0.0772 —0.0086 0.0253 —0.0085 
wayo 0.0359 0.0064 0.0577 0.0762 
wnyo —0.1263 —0.0135 0.0584 —0.0189 
mmyo 0.0793 —0.0076 0.0173 —0.0039 
wmyo —0.0550 —0.0077 0.0579 0.0416 
msyo 0.0763 0.0207 0.0575 —0.0778 
wsyo 0.0120 0.0149 0.0532 —0.0366 
maes 0.0767 —0.0025 0.0047 ~=0.0115 
waes 0.0353 0.0209 0.0488 (0.0729 
wnes —0.1399 0.0016 0.0240 —0.0348 
mmes 0.0742 —0.0061 —0.0152 0.0283 
wmes —0.0175 0.0073 0.0429 0.0719 
mses 0.0903 0.0052 0.0379 —0.0701 
fses 0.0020 0.0287 0.0358 —0.0346 


Table 9.18. PCs for time budget data. 


9.10 Exercises 


EXERCISE 9.1 Prove Theorem 9.1. (Hint: use (4.23).) 


EXERCISE 9.2 Interpret the results of the PCA of the U.S. companies. Use the analysis of 
the bank notes in Section 9.3 as a guide. Compare your results with those in Example 9.9. 


EXERCISE 9.3 Test the hypothesis that the proportion of variance explained by the first two 
PCs for the U.S. companies is w = 0.75. 
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EXERCISE 9.4 Apply the PCA to the car data (Table B.7). Interpret the first two PCs. 
Would it be necessary to look at the third PC? 


EXERCISE 9.5 Take the athletic records for 55 countries (Appendix B.18) and apply the 
NPCA. Interpret your results. 


EXERCISE 9.6 Apply a PCA to © = : A , where p > 0. Now change the scale of 


X1, 1.e., consider the covariance of cX, and X2. How do the PC directions change with the 
screeplot? 


EXERCISE 9.7 Suppose that we have standardized some data using the Mahalanobis trans- 
formation. Would it be reasonable to apply a PCA? 


EXERCISE 9.8 Apply a NPCA to the U.S. CRIME data set (Table B.10). Interpret the 
results. Would it be necessary to look at the third PC? Can you see any difference between 
the four regions? Redo the analysis excluding the variable “area of the state.” 


EXERCISE 9.9 Repeat Exercise 9.8 using the U.S. HEALTH data set (Table B.16). 
EXERCISE 9.10 Do a NPCA on the GEOPOL data set (see Table B.15) which compares 


41 countries w.r.t. different aspects of their development. Why or why not would a PCA be 
reasonable here? 


EXERCISE 9.11 Let U be an uniform r.v. on [0,1]. Let a € R® be a vector of constants. 
Suppose that X =Ua' = (X1, Xo, X3). What do you expect the NPCs of X to be? 


EXERCISE 9.12 Let U,; and U2 be two independent uniform random variables on [0,1]. 

Suppose that X = (X1, Xo, X3, X4)" where X1 = U;, Xo = Up, X3 = U; + Up and X4 = 

U, — U2. Compute the correlation matrix P of X. How many PCs are of interest? Show 
T T 

that 7, = (+: Tor 1,0) and yz = (+: =: 9, 1) are eigenvectors of P corresponding to 

the non trivial A‘s. Interpret the first two NPCs obtained. 


EXERCISE 9.13 Simulate a sample of size n = 50 for the r.v. X in Exercise 9.12 and 
analyze the results of a NPCA. 


EXERCISE 9.14 Bouroche (1980) reported the data on the state expenses of France from the 
period 1872 to 1971 (24 selected years) by noting the percentage of 11 categories of expenses. 
Do a NPCA of this data set. Do the three main periods (before WWI, between WWI and 
WWI, and after WWII) indicate a change in behavior w.r.t. to state expenses? 


10 Factor Analysis 


A frequently applied paradigm in analyzing data from multivariate observations is to model 
the relevant information (represented in a multivariate variable X) as coming from a limited 
number of latent factors. In a survey on household consumption, for example, the consump- 
tion levels, X, of p different goods during one month could be observed. The variations and 
covariations of the p components of X throughout the survey might in fact be explained by 
two or three main social behavior factors of the household. For instance, a basic desire of 
comfort or the willingness to achieve a certain social level or other social latent concepts 
might explain most of the consumption behavior. These unobserved factors are much more 
interesting to the social scientist than the observed quantitative measures (X) themselves, 
because they give a better understanding of the behavior of households. As shown in the ex- 
amples below, the same kind of factor analysis is of interest in many fields such as psychology, 
marketing, economics, politic sciences, etc. 


How can we provide a statistical model addressing these issues and how can we interpret 
the obtained model? This is the aim of factor analysis. As in Chapter 8 and Chapter 9, the 
driving statistical theme of this chapter is to reduce the dimension of the observed data. The 
perspective used, however, is different: we assume that there is a model (it will be called the 
“Factor Model”) stating that most of the covariances between the p elements of X can be 
explained by a limited number of latent factors. Section 10.1 defines the basic concepts and 
notations of the orthogonal factor model, stressing the non-uniqueness of the solutions. We 
show how to take advantage of this non-uniqueness to derive techniques which lead to easier 
interpretations. This will involve (geometric) rotations of the factors. Section 10.2 presents 
an empirical approach to factor analysis. Various estimation procedures are proposed and 
an optimal rotation procedure is defined. Many examples are used to illustrate the method. 


10.1 The Orthogonal Factor Model 


The aim of factor analysis is to explain the outcome of p variables in the data matrix ¥ using 
fewer variables, the so-called factors. Ideally all the information in ¥ can be reproduced by 
a smaller number of factors. These factors are interpreted as latent (unobserved) common 
characteristics of the observed x € R?. The case just described occurs when every observed 
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v = (%1,...,Zp)' can be written as 
vj =Soajefet ty J = 1-5. (10.1) 
=1 
Here fe, for 2 =1,...,k denotes the factors. The number of factors, k, should always be much 


smaller than p. For instance, in psychology x may represent p results of a test measuring 
intelligence scores. One common latent factor explaining 7 € R? could be the overall level 
of “intelligence”. In marketing studies, « may consist of p answers to a survey on the levels 
of satisfaction of the customers. These p measures could be explained by common latent 
factors like the attraction level of the product or the image of the brand, and so on. Indeed 
it is possible to create a representation of the observations that is similar to the one in (10.1) 
by means of principal components, but only if the last p — k eigenvalues corresponding to 
the covariance matrix are equal to zero. Consider a p-dimensional random vector X with 
mean p and covariance matrix Var(X) = 4. A model similar to (10.1) can be written for X 
in matrix notation, namely 


X=QOF +n, (10.2) 


where F is the k-dimensional vector of the k factors. When using the factor model (10.2) it 
is often assumed that the factors F' are centered, uncorrelated and standardized: E(F’) = 0 
and Var(F’) = Z;,. We will now show that if the last p — k eigenvalues of © are equal to zero, 
we can easily express X by the factor model (10.2). 


The spectral decomposition of © is given by TAI''. Suppose that only the first k eigenvalues 


are positive, ie., Apyi =... =A, = 0. Then the (singular) covariance matrix can be written 
as : 
A OV (1G 
= Bove -einn( 8) 


In order to show the connection to the factor model (10.2), recall that the PCs are given by 
yY=T' (X — yp). Rearranging we have X — uw =TY =T\Y, + T2Y2, where the components 
of Y are partitioned according to the partition of [ above, namely 


v= (3 )-(FE) em stm (HE) ern (0(8 8) 


In other words, Y2 has a singular distribution with mean and covariance matrix equal to 
zero. Therefore, X — w = T,Y, + P2Y2 implies that X — p is equivalent to [,Y,, which can 
be written as 


X =T AAS ?Y, + ps. 
Defining Q =T, Aj’? and F = Aj 'Y,, we obtain the factor model (10.2). 


Note that the covariance matrix of model (10.2) can be written as 


D = E(X — )(X — yp)’ = QE(FF")Q" = QQ" = » wy (10.3) 
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We have just shown how the variable X can be completely determined by a weighted sum 
of k (where k < p) uncorrelated factors. The situation used in the derivation, however, is 
too idealistic. In practice the covariance matrix is rarely singular. 


It is common praxis in factor analysis to split the influences of the factors into common 
and specific ones. There are, for example, highly informative factors that are common to 
all of the components of X and factors that are specific to certain components. The factor 
analysis model used in praxis is a generalization of (10.2): 


X=QF+U+uy, (10.4) 


where Q is a (p x k) matrix of the (non-random) loadings of the common factors F'(k x 1) 
and U is a (p x 1) matrix of the (random) specific factors. It is assumed that the factor 
variables F’ are uncorrelated random vectors and that the specific factors are uncorrelated 
and have zero covariance with the common factors. More precisely, it is assumed that: 


BF = ©, 
Vari) = De 
EU = 0, (10.5) 
Could) = 0. a9 


Cowl FU) = 0. 


Define 
Var(U) = V = diag(wi1,..., Wp). 


The generalized factor model (10.4) together with the assumptions given in (10.5) constitute 
the orthogonal factor model. 


Orthogonal Factor Model 
xX = Q F + U + [ 
(p x 1) (px k) (kx 1) (p x 1) (p x 1) 

[hj - mean of variable 7 

U; = j-th specific factor 

Fy = é-th common factor 

Gye = loading of the j-th variable on the ¢-th factor 

The random vectors F’ and U are unobservable and uncorrelated. 


Note that (10.4) implies for the components of X = (X1,..., Xp)! that 


k 


X53 = So ae a + Uj + ty, Vogt eee (10.6) 
(=1 
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Using (10.5) we obtain ox,x, = Var(X;) = ae Gig + Yj;- The quantity hf = 3 Ge 
is called the communality and w,;; the specific variance. Thus the covariance of X can be 
rewritten as 


Y= E(X-p)\(X-p)' =E(QOF+U)(QF+U)' 
QE(FF')Q' + E(UU") = Q Var(F)Q' + Var(U) 
QOO'+W. (10.7) 


In a sense, the factor model explains the variations of X for the most part by a small num- 
ber of latent factors F’ common to its p components and entirely explains all the correlation 
structure between its components, plus some “noise” U which allows specific variations of 
each component to enter. The specific factors adjust to capture the individual variance of 
each component. Factor analysis relies on the assumptions presented above. If the assump- 
tions are not met, the analysis could be spurious. Although principal components analysis 
and factor analysis might be related (this was hinted at in the derivation of the factor model), 
they are quite different in nature. PCs are linear transformations of X arranged in decreas- 
ing order of variance and used to reduce the dimension of the data set, whereas in factor 
analysis, we try to model the variations of X using a linear transformation of a fixed, limited 
number of latent factors. The objective of factor analysis is to find the loadings Q and the 
specific variance V. Estimates of Q and W are deduced from the covariance structure (10.7). 


Interpretation of the Factors 


Assume that a factor model with k factors was found to be reasonable, i.e., most of the 
(co)variations of the p measures in X were explained by the k fixed latent factors. The 
next natural step is to try to understand what these factors represent. To interpret F?, it 
makes sense to compute its correlations with the original variables X, first. This is done for 
(€=1,...,k and for 7 =1,...,p to obtain the matrix Py. The sequence of calculations used 
here are in fact the same that were used to interprete the PCs in the principal components 
analysis. 


The following covariance between X and F is obtained via (10.5), 
Uxe = E{(QOF+U)F"}=9. 


The correlation is 


Pye DY" O, (10.8) 
where D = diag(ox,x,,...,0x,x,). Using (10.8) it is possible to construct a figure analogous 
to Figure 9.6 and thus to consider which of the original variables X;,...,X, play a role in 
the unobserved common factors F{,..., Fy. 


Returning to the psychology example where X are the observed scores to p different intelli- 
gence tests (the WAIS data set in Table B.12 provides an example), we would expect a model 


10.1 The Orthogonal Factor Model 279 


with one factor to produce a factor that is positively correlated with all of the components 
in X. For this example the factor represents the overall level of intelligence of an individual. 
A model with two factors could produce a refinement in explaining the variations of the p 
scores. For example, the first factor could be the same as before (overall level of intelligence), 
whereas the second factor could be positively correlated with some of the tests, X;, that are 
related to the individual’s ability to think abstractly and negatively correlated with other 
tests, X;, that are related to the individual’s practical ability. The second factor would 
then concern a particular dimension of the intelligence stressing the distinctions between the 
“theoretical” and “practical” abilities of the individual. If the model is true, most of the 
information coming from the p scores can be summarized by these two latent factors. Other 
practical examples are given below. 


Invariance of Scale 


What happens if we change the scale of X to Y = CX with C = diag(ci,...,c,)? If the 
k-factor model (10.6) is true for X with OQ = Oy, V = Wy, then, since 


Var(Y) =C=C" =COxe OC! +CU ZC", 


the same k-factor model is also true for Y with Qy = CQx and Vy = CVxC!'. In many 
applications, the search for the loadings Q and for the specific variance W will be done by 
the decomposition of the correlation matrix of X rather than the covariance matrix /. This 
corresponds to a factor analysis of a linear transformation of X (i.e, Y = D~/?(X — p)). 
The goal is to try to find the loadings Qy and the specific variance Vy such that 


P=Qy O+ Wy. (10.9) 


In this case the interpretation of the factors F’ immediately follows from (10.8) given the 
following correlation matrix: 


Pxyp = Pyp = Qy. (10.10) 


Because of the scale invariance of the factors, the loadings and the specific variance of the 
model, where X is expressed in its original units of measure, are given by 


Ox D'? Oy 
Uy = D'?PW,D?. 


It should be noted that although the factor analysis model (10.4) enjoys the scale invariance 
property, the actual estimated factors could be scale dependent. We will come back to this 
point later when we discuss the method of principal factors. 
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Non-Uniqueness of Factor Loadings 


The factor loadings are not unique! Suppose that G is an orthogonal matrix. Then X 
in (10.4) can also be written as 


X=(OG)(G'F\)+U +p. 


This implies that, if a k-factor of X with factors F’ and loadings Q is true, then the k- 
factor model with factors G'F and loadings QG is also true. In practice, we will take 
advantage of this non-uniqueness. Indeed, referring back to Section 2.6 we can conclude that 
premultiplying a vector F by an orthogonal matrix corresponds to a rotation of the system 
of axis, the direction of the first new axis being given by the first row of the orthogonal 
matrix. It will be shown that choosing an appropriate rotation will result in a matrix of 
loadings QG that will be easier to interpret. We have seen that the loadings provide the 
correlations between the factors and the original variables, therefore, it makes sense to search 
for rotations that give factors that are maximally correlated with various groups of variables. 


From a numerical point of view, the non-uniqueness is a drawback. We have to find loadings 
Q and specific variances WV satisfying the decomposition ) = QQ' + W, but no straightfor- 
ward numerical algorithm can solve this problem due to the multiplicity of the solutions. 
An acceptable technique is to impose some chosen constraints in order to get—in the best 
case—an unique solution to the decomposition. Then, as suggested above, once we have a 
solution we will take advantage of the rotations in order to obtain a solution that is easier 
to interprete. 


An obvious question is: what kind of constraints should we impose in order to eliminate the 
non-uniqueness problem? Usually, we impose additional constraints where 


Oo'w-!o is diagonal (10.11) 


or 
ol ome) is diagonal. (10.12) 


How many parameters does the model (10.7) have without constraints? 


O(p x k) has p-k parameters, and 
U(p x p) has p parameters. 


Hence we have to determine pk + p parameters! Conditions (10.11) respectively (10.12) 
introduce ${k(k — 1)} constraints, since we require the matrices to be diagonal. Therefore, 
the degrees of freedom of a model with k factors is: 


d = (# parameters for © unconstrained) — (# parameters for ¥ constrained) 
aP(p + 1) — (pk +p — Gk(k— 1) 
= 3(p—k)?—3(p +k). 
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If d < 0, then the model is undetermined: there are infinitly many solutions to (10.7). This 
means that the number of parameters of the factorial model is larger than the number of 
parameters of the original model, or that the number of factors k is “too large” relative to 
p. In some cases d = 0: there is an unique solution to the problem (except for rotation). In 
practice we usually have that d > O:there are more equations than parameters, thus an exact 
solution does not exist. In this case approximate solutions are used. An approximation of 
S, for example, is QQ' + WV. The last case is the most interesting since the factorial model 
has less parameters than the original one. Estimation methods are introduced in the next 
section. 


Evaluating the degrees of freedom, d, is particularly important, because it already gives an 
idea of the upper bound on the number of factors we can hope to identify in a factor model. 
For instance, if p = 4, we could not identify a factor model with 2 factors (this results in 
d = —1 which has infinitly many solutions). With p = 4, only a one factor model gives an 
approximate solution (d = 2). When p = 6, models with 1 and 2 factors provide approximate 
solutions and a model with 3 factors results in an unique solution (up to the rotations) since 
d= 0. A model with 4 or more factors would not be allowed, but of course, the aim of factor 
analysis is to find suitable models with a small number of factors, i.e., smaller than p. The 
next two examples give more insights into the notion of degrees of freedom. 


EXAMPLE 10.1 Let p=3 andk =1, thend=0 and 


Oi O12 O13 gtvu 9192 1.93 
“= | O21 O22 023 | = ne G+ve Qe 
O31 032 033 1493 93-3 + 33 
q Yu O 0 
with O = q2 and UV = 0 wo 0 . Note that here the constraint (10.8) is 
q3 0 0 33 


automatically verified sincek =1. We have 
2 912913. 9 9129023. 9 913023 
> ra 
023 O13 O12 
and 
Wu = Ou - hi oo = 029 — qh; 33 = 033 — G3 

In this particular case (k = 1), the only rotation is defined by G = —1, so the other solution 
for the loadings 1s provided by —Q. 


EXAMPLE 10.2 Suppose now p= 2 andk =1, thend <0 and 


sef{le\)_( atu ame . 
pl 1492 gz + Woe 


We have infinitely many solutions: for any a (p <a <1), a solution is provided by 


hm =a; @=p/a; vy = 1-07; We =1-(p/a)?. 
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The solution in Example 10.1 may be unique (up to a rotation), but it is not proper in the 
sense that it cannot be interpreted statistically. Exercise 10.5 gives an example where the 
specific variance w, is negative. 


/N Even in the case of a unique solution (d = 0), the solution may be inconsistent with 
statistical interpretations. 


aan Summary 


<< The factor analysis model aims to describe how the original p variables 
in a data set depend on a small number of latent factors k < p, i.e., it 
assumes that X = OF + U+ yp. The (k-dimensional) random vector F' 
contains the common factors, the (p-dimensional) U contains the specific 
factors and Q(p x k) contains the factor loadings. 


<> It is assumed that F’ and U are uncorrelated and have zero means, i.e., 
F ~ (0,2), U ~ (0, V) where W is diagonal matrix and Cou(F,U) = 0. 
This leads to the covariance structure © = QQ' + W. 


<< The interpretation of the factor F' is obtained through the correlation 
Pyp = D2 OQ. 


<> A normalized analysis is obtained by the model P = QQ' + V. The 
interpretation of the factors is given directly by the loadings Q: Pyr = Q. 


<+ The factor analysis model is scale invariant. The loadings are not unique 
(only up to multiplication by an orthogonal matrix). 


<— Whether a model has an unique solution or not is determined by the 
degrees of freedom d = 1/2(p — k)? — 1/2(p +k). 


10.2 Estimation of the Factor Model 


In practice, we have to find estimates a) of the loadings Q and estimates W of the specific 
variances V such that analogously to (10.7) 


S=QO'+Y, 
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where S denotes the empirical covariance of 4. Given an estimate O of Q, it is natural to 


set 
k 


~ 25 
Ojj = 8xjX} — y Qe: 


é=1 
We have that ne = Ss Ge is an estimate for the communality 5. 


In the ideal case d = 0, there is an exact solution. However, d is usually greater than zero, 
therefore we have to find Q and W such that S is approximated by QQ'+W. As mentioned 
above, it is often easier to compute the loadings and the specific variances of the standardized 
model. 


Define Y = HXYD~'/?, the standardization of the data matrix ¥, where, as usual, D = 
diag(sx,x,,-.-,x,x,) and the centering matrix H = Z — n-+1,1, (recall from Chapter 2 
that S = 1X 'HX). The estimated factor loading matrix Oy and the estimated specific 
variance Uy of Y are 

on = Dp PO. and Uy = Dt, 


For the correlation matrix R of V, we have that 
R= Oy OF. + Vy. 


The interpretations of the factors are formulated from the analysis of the loadings Ovi 


EXAMPLE 10.3 Let us calculate the matrices just defined for the car data given in Ta- 
ble B.7. This data set consists of the averaged marks (from 1 =low to 6 =high) for 24 car 
types. Considering the three variables price, security and easy handling, we get the following 


correlation matriz: 
1 0.975 0.613 


R= {| 0.975 1 0.620 
0.613 0.620 1 


We will first look for one factor, i.e., k = 1. Note that (# number of parameters of © 
unconstrained — # parameters of © constrained) is equal to $(p — k)? — $(p+k) = $(3- 
1)? — 3(3+1) =0. This implies that there is an exact solution! The equation 


| TX\X_ 1X X3 G+vu AE AG 
TX1X2 it TxX,X, | =R= Ag G + Wr PIs 
PX1X3 "X2X3 1 19 gs G3 + 33 


yields the communalities h? = @2, where 


a2 VX, XolX 1X3 a2 VX XolX2X3 nd a2 TX, X3"XoX3 
| a ’ 2— 37 7 
T XoX3 TX, X3 TX Xo 
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Combining this with the specific variances (1, =1—@ , bx» =1—-@ and o33 = 1—@, we 
obtain the following solution 


nm = 0.982 @ = 0.993 q3 = 0.624 
Wi = 0.035 Woe = 0.014 w33 = 0.610. 


Since the first two communalities (r? = 7) are close to one, we can conclude that the first 
two variables, namely price and security, are explained by the single factor quite well. This 
factor can be interpreted as a “price+security” factor. 


The Maximum Likelihood Method 


Recall from Chapter 6 the log-likelihood function @ for a data matrix VY of observations of 
A. ~ Np (u, yy): 


n 


n 1 _ 
—5 les | 2rd | =a So (zi — p=" (a; — p)" 


i=1 


= = log | 27 | —5 tr(E"18) 


&(X; ps, 4) 


=[@-H= @=y)". 


This can be rewritten as 
e(#; 2B) = —F {log | 2 | +tr(D-18)} 
Replacing js by i = F and substituting © = QQ' + W this becomes 
(X37i, Q,v) = = [log{| 2r(QQ™ + W) |} + tr{((QQ™ + V)1s}]. (10.13) 


Even in the case of a single factor (k = 1), these equations are rather complicated and 
iterative numerical algorithms have to be used (for more details see Mardia et al. (1979, p. 
263ff)). A practical computation scheme is also given in Supplement 9A of Johnson and 
Wichern (1998). 


Likelihood Ratio Test for the Number of Common Factors 


Using the methodology of Chapter 7, it is easy to test the adequacy of the factor analy- 
sis model by comparing the likelihood under the null (factor analysis) and alternative (no 
constraints on covariance matrix) hypotheses. 
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Assuming that O and W are the maximum likelihood estimates corresponding to (10.13), we 
obtain the following LR test statistic: 


31 maximized likelihood under Ho \ _ |QQ" + v| (10.14) 
Pe maximized likelihood ae |S| , 
which asymptotically has the i {(p-k)?—p-k} distribution. 


The x? approximation can be improved if we replace n by n— 1 — (2p + 4k +5)/6 in (10.14) 
(Bartlett, 1954). Using Bartlett’s correction, we reject the factor analysis model at the a 
level if 


AAs. Bs 
{n —1— (2p+ 4k + 5)/6} log (ee) > x7 


IS] X1—as{ (p—k)2—p—h}/29 (10.15) 


and if the number of observations n is large and the number of common factors k is such 
that the y? statistic has a positive number of degrees of freedom. 


The Method of Principal Factors 


The method of principal factors concentrates on the decomposition of the correlation matrix 
R or the covariance matrix S. For simplicity, only the method for the correlation matrix 
R will be discussed. As pointed out in Chapter 9, the spectral decompositions of R and S 
yield different results and therefore, the method of principal factors may result in different 
estimators. The method can be motivated as follows: Suppose we know the exact WV, then the 
constraint (10.12) implies that the columns of Q are orthogonal since D = J and it implies 
that they are eigenvectors of QQ' = R—W. Furthermore, assume that the first k eigenvalues 
are positive. In this case we could calculate Q by means of a spectral decomposition of QQ! 
and k would be the number of factors. 


The principal factors algorithm is based on good preliminary estimators ie of the commu- 


nalities hi, for 7 = 1,...,p. There are two traditional proposals: 


e he, defined as the square of the multiple correlation coefficient of X; with (X7), for 
LS 7, 165-0 (V, Ww) with V = X;, W = (Xoe)ez; and where ( is the least squares 
regression parameter of a regression of V on W. 


e h? = max|rx.x,|, where R = (rx,x,) is the correlation matrix of X. 
J bAj jAEN) jit e 


Given y;; = 1— h? we can construct the reduced correlation matrit, R — V. The Spectral 
Decomposition Theorem says that 


P 
R-V=S dw ; 
l=1 
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with eigenvalues \; > --- > Ap. Assume that the first k eigenvalues \j,...,A, are positive 
and large compared to the others. Then we can set 


h@=aJVruw, €=1,...,k 


or 


O=T,A,” 


with 
Ty =(%,--->%e) and A, =diag(A1,...,Ax). 


In the next step set 
k 
bi =1-S—& P= dye. 
(=1 


Note that the procedure can be iterated: from Di; we can compute a new reduced correlation 


matrix R — U following the same procedure. The iteration usually stops when the Wi; have 
converged to a stable value. 


EXAMPLE 10.4 Consider once again the car data given in Table B.7. From Exercise 9.4 
we know that the first PC is mainly influenced by X2.—X7. Moreover, we know that most 
of the variance is already captured by the first PC. Thus we can conclude that the data are 
mainly determined by one factor (k = 1). 


The eigenvalues of R— WV for U = (max Irx,x,|) are 
pHi 


(5.448, 0.003, —.246, —0.646, —0.901, —0.911, —0.948, —0.964)' . 


It would suffice to choose only one factor. Nevertheless, we have computed two factors. The 
result (the factor loadings for two factors) is shown in Figure 10.1. 


We can clearly see a cluster of points to the right, which contain the factor loadings for the 
variables Xy—-X7. This shows, as did the PCA, that these variables are highly dependent 
and are thus more or less equivalent. The factor loadings for X, (economy) and Xx (easy 
handling) are separate, but note the different scales on the horizontal and vertical azes! 
Although there are two or three sets of variables in the plot, the variance is already explained 
by the first factor, the “price+security” factor. 


The Principal Component Method 


The principal factor method involves finding an approximation W of W, the matrix of specific 
variances, and then correcting R, the correlation matrix of X, by UV. The principal component 


A 


method starts with an approximation Q of Q, the factor loadings matrix. The sample 
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car marks data 


second factor 


service 


-1.5 -1 -0.5 0 0.5 
first factor 


Figure 10.1. Loadings of the evaluated car qualities, factor analysis with 
k =2. Q MVAfactcarm.xpl 


covariance matrix is diagonalized, S = TAT''. Then the first k eigenvectors are retained to 


build 
9 =[Vrrn,..-, Vere. (10.16) 


The estimated specific variances are provided by the diagonal elements of the matrix S — 
vu 0 
Wop ‘ 3 
. with Wis = SXjXj Gp. (10.17) 


f=1 


> 
| 


0 Dp 
By definition, the diagonal elements of S are equal to the diagonal elements of OO'+W. The 


off-diagonal elements are not necessarily estimated. How good then is this approximation? 
Consider the residual matrix 


S—(QO'+W) 
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resulting from the principal component solution. Analytically we have that 


So(S — BOT - 2. < N+. $M. 

tj 
This implies that a small value of the neglected eigenvalues can result in a small approxima- 
tion error. A heuristic device for selecting the number of factors is to consider the proportion 
of the total sample variance due to the j-th factor. This quantity is in general equal to 


(A) A;/ 04-1 $3; for a factor analysis of S, 


(B) A;/p for a factor analysis of R. 


EXAMPLE 10.5 This example uses a consumer-preference study from Johnson and Wich- 
ern (1998). Customers were asked to rate several attributes of a new product. The responses 
were tabulated and the following correlation matrix R was constructed: 


Attribute (Variable) 

Taste 1 1.00 0.02 0.96 0.42 0.01 
Good buy for money Z 0.02 1.00 0.138 0.71 0.85 
Flavor a 0.96 0.13 1.00 0.50 0.11 
Suitable for snack 4 0.42 0.71 0.50 1.00 0.79 
Provides lots of energy 5 0.01 0.85 0.11 0.79 1.00 


The bold entries of R show that variables 1 and 8 and variables 2 and 5 are highly correlated. 
Variable 4 is more correlated with variables 2 and 5 than with variables 1 and 3. Hence, a 
model with 2 (or 3) factors seems to be reasonable. 


The first two eigenvalues Ay = 2.85 and Ay = 1.81 of R are the only eigenvalues greater than 
one. Moreover, k = 2 common factors account for a cumulative proportion 
ArtAq 2.854 1.81 


= = 0.93 
Dp i) 


of the total (standardized) sample variance. Using the principal component method, the 
estimated factor loadings, communalities, and specific variances, are calculated from formu- 
las (10.16) and (10.17), and the results are given in Table 10.2. 


Take a look at: 


0.56 0.82 
0.78 —0.53 
OQO'+U=]| 0.65 0.75 ( 
0.94 —0.11 
0.80 —0.54 


0.56 0.78 0.65 0.94 0.80 
0.82 —0.53 0.75 —0.11 —0.54 
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Estimated factor Specific 
loadings Communalities variances 
Variable a do h? big =1- he 
1. Taste 0.56 0.82 0.98 0.02 
2. Good buy for money 0.78 -0.53 0.88 0.12 
3. Flavor 0.65 0.75 0.98 0.02 
4. Suitable for snack 0.94 -0.11 0.89 OAL. 
5. Provides lots ofenergy 0.80 -0.54 0.93 0.07 
Eigenvalues 2.85 1.81 
Cumulative proportion of 0.571 0.932 


total (standardized) sam- 
ple variance 


Table 10.2. Estimated factor loadings, communalities, and specific vari- 


ances 
0.02 0 0 0 0 1.00 0.01 0.97 0.44 0.00 
0 0.12 0 0 0 0.01 1.00 0.11 0.79 0.91 
=e; 0 0 0.02 0 0 =] 0.97 0.11 1.00 0.53 0.11 
0 0 0 0.11 0 0.44 0.79 0.53 1.00 0.81 
0 0 0 0 0.07 0.00 0.91 0.11 0.81 1.00 


This nearly reproduces the correlation matriz R. We conclude that the two-factor model 
provides a good fit of the data. The communalities (0.98, 0.88, 0.98, 0.89, 0.93) indicate that 
the two factors account for a large percentage of the sample variance of each variable. Due 
to the nonuniqueness of factor loadings, the interpretation might be enhanced by rotation. 
This is the topic of the next subsection. 


Rotation 


The constraints (10.11) and (10.12) are given as a matter of mathematical convenience (to 
create unique solutions) and can therefore complicate the problem of interpretation. The 
interpretation of the loadings would be very simple if the variables could be split into disjoint 
sets, each being associated with one factor. A well known analytical algorithm to rotate the 
loadings is given by the varimaz rotation method proposed by Kaiser (1985). In the simplest 
case of k = 2 factors, a rotation matrix G is given by 
cos@ sind 
ae) ( —sin@ cosé ) ; 


representing a clockwise rotation of the coordinate axes by the angle 6. The corresponding 
rotation of loadings is calculated via O* = QG(0). The idea of the varimaz method is to find 
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the angle @ that maximizes the sum of the variances of the squared loadings q;; within each 


column of O*. More precisely, defining qj = 4}, 7 h*, the varimaz criterion chooses 6 so that 


vais |G) {3 ME i’ 


j=l 


is Maximized. 


EXAMPLE 10.6 Let us return to the marketing example of Johnson and Wichern (1998) 
(Example 10.5). The basic factor loadings given in Table 10.2 of the first factor and a 
second factor are almost identical making it difficult to interpret the factors. Applying 
the varimazx rotation we obtain the loadings q, = (0.02, 0.94, 0.13,0.84,0.97)' and @ = 
(0.99, —0.01, 0.98, 0.43, —0.02)'. The high loadings, indicated as bold entries, show that 
variables 2, 4, 5 define factor 1, a nutricional factor. Variable 1 and 8 define factor 2 which 
might be referred to as a taste factor. 


na Summary 


<> In practice, Q and W have to be estimated from S = QQ' + WV. The 
number of parameters is d = $(p — k)? — $(p +k). 


<— If d = 0, then there exists an exact solution. In practice, d is usually 
greater than 0, thus approximations must be considered. 


~ The maximum-likelihood method assumes a normal distribution for the 
data. A solution can be found using numerical algorithms. 


<> The method of principal factors is a two-stage method which calculates Q 
from the reduced correlation matrix R — V, where Visa pre-estimate of 


W. The final estimate of V is found by Dig = =1- yy ‘ Gi. 


{ 


The principal component method is based on an approximation, Q, of Q. 


[ 


Often a more informative interpretation of the factors can be found by 
rotating the factors. 


<> The varimax rotation chooses a_ rotation 6 that maximizes 


=35h, [5% @'- rR GF] 
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10.3. Factor Scores and Strategies 


Up to now strategies have been presented for factor analysis that have concentrated on the 
estimation of loadings and communalities and on their interpretations. This was a logical 
step since the factors F’ were considered to be normalized random sources of information 
and were explicitely addressed as nonspecific (common factors). The estimated values of the 
factors, called the factor scores, may also be useful in the interpretation as well as in the 
diagnostic analysis. To be more precise, the factor scores are estimates of the unobserved 
random vectors F}, / = 1,...,k, for each individual 7;, 2 = 1,...,n. Johnson and Wichern 
(1998) describe three methods which in practice yield very similar results. Here, we present 
the regression method which has the advantage of being the simplest technique and is easy 
to implement. 


The idea is to consider the joint distribution of (X — yz) and F,, and then to proceed with 
the regression analysis presented in Chapter 5. Under the factor model (10.4), the joint 
covariance matrix of (X — y) and F is: 


= Tay 
Var ( - ") as (°°, A) (10.18) 


Note that the upper left entry of this matrix equals © and that the matrix has size (p+ k) x 
(p+k). 


Assuming joint normality, the conditional distribution of F'|X is multinormal, see Theo- 
rem 5.1, with 
E(F|X =2)= Q'S 1(X —p) (10.19) 


and using (5.7) the covariance matrix can be calculated: 
Var(F|X = 2) =2, - Q'd19. (10.20) 


In practice, we replace the unknown Q, » and yp by corresponding estimators, leading to the 
estimated individual factor scores: 


fi =O" S"\(x;-7). (10.21) 


We prefer to use the original sample covariance matrix S as an estimator of ©, instead of 
the factor analysis approximation O07 + Uv, in order to be more robust against incorrect 
determination of the number of factors. 


The same rule can be followed when using R instead of S. Then (10.18) remains valid when 
standardized variables, i.e., Z = D(x — 1), are considered if Dy = diag(oy,..., pp). In 
this case the factors are given by 


f, = OD R(x), (10.22) 
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where z; = De? (a; — £), Q is the loading obtained with the matrix R, and Ds = 
digg (Sit; «~«5 Spe): 


If the factors are rotated by the orthogonal matrix G, the factor scores have to be rotated 
accordingly, that is 7 7 
=o he (10.23) 


A practical example is presented in Section 10.4 using the Boston Housing data. 


Practical Suggestions 


No one method outperforms another in the practical implementation of factor analysis. 
However, by applying the tatonnement process, the factor analysis view of the data can be 
stabilized. This motivates the following procedure. 


1. Fix a reasonable number of factors, say k = 2 or 3, based on the correlation structure 
of the data and/or screeplot of eigenvalues. 


2. Perform several of the presented methods, including rotation. Compare the loadings, 
communalities, and factor scores from the respective results. 


3. If the results show significant deviations, check for outliers (based on factor scores), 
and consider changing the number of factors k. 


For larger data sets, cross-validation methods are recommended. Such methods involve 
splitting the sample into a training set and a validation data set. On the training sample one 
estimates the factor model with the desired methodology and uses the obtained parameters 
to predict the factor scores for the validation data set. The predicted factor scores should be 
comparable to the factor scores obtained using only the validation data set. This stability 
criterion may also involve the loadings and communalities. 


Factor Analysis versus PCA 


Factor analysis and principal component analysis use the same set of mathematical tools 
(spectral decomposition, projections, ...). One could conclude, on first sight, that they 
share the same view and strategy and therefore yield very similar results. This is not true. 
There are substantial differences between these two data analysis techniques that we would 
like to describe here. 


The biggest difference between PCA and factor analysis comes from the model philosophy. 
Factor analysis imposes a strict structure of a fixed number of common (latent) factors 
whereas the PCA determines p factors in decreasing order of importance. The most impor- 
tant factor in PCA is the one that maximizes the projected variance. The most important 
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Estimated factor Specific 
loadings Communalities variances 
on qe ds hi Dig = 1 — hi 

1 crime 0.9295 —0.1653 0.1107 0.9036 0.0964 
2 large lots —0.5823 —0.0379 0.2902 0.4248 0.5752 
3 nonretail acres 0.8192 0.0296 —0.1378 0.6909 0.3091 
5 nitric oxides 0.8789 —0.0987 —0.2719 0.8561 0.1439 
6 rooms —0.4447 —0.5311 —0.0380 0.4812 0.5188 
7 prior 1940 0.7837 0.0149 —0.3554 0.7406 0.2594 
8 empl. centers —0.8294 0.1570 0.4110 0.8816 0.1184 
9 accessibility 0.7955 —0.3062 0.4053 0.8908 0.1092 
10 tax-rate 0.8262 —0.1401 0.2906 0.7867 0.2133 
11 pupil/teacher 0.5051 0.1850 0.1553 0.3135 0.6865 
12 blacks —0.4701  —0.0227 —0.1627 0.2480 0.7520 
13 lower status 0.7601 0.5059 —0.0070 0.8337 0.1663 
14 value —0.6942 —0.5904 —0.1798 0.8628 0.1371 


Table 10.4. Estimated factor loadings, communalities, and specific vari- 
ances, MLM. @ MVAfacthous.xpl 


factor in factor analysis is the one that (after rotation) gives the maximal interpretation. 
Often this is different from the direction of the first principal component. 


From an implementation point of view, the PCA is based on a well-defined, unique algo- 
rithm (spectral decomposition), whereas fitting a factor analysis model involves a variety of 
numerical procedures. The non-uniqueness of the factor analysis procedure opens the door 
for subjective interpretation and yields therefore a spectrum of results. This data analy- 
sis philosophy makes factor analysis difficult especially if the model specification involves 
cross-validation and a data-driven selection of the number of factors. 


10.4 Boston Housing 


To illustrate how to implement factor analysis we will use the Boston housing data set and 
the by now well known set of transformations. Once again, the variable X, (Charles River 
indicator) will be excluded. As before, standardized variables are used and the analysis is 
based on the correlation matrix. 


In Section 10.3, we described a practical implementation of factor analysis. Based on prin- 
cipal components, three factors were chosen and factor analysis was applied using the max- 
imum likelihood method (MLM), the principal factor method (PFM), and the principal 
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Figure 10.2. Factor analysis for Boston housing data, MLM. 
Q MVAfacthous.xpl 


Estimated factor Specific 
loadings Communalities variances 
q qa QB hi Vii =l]- hi 

1 crime 0.8413 —0.0940 —0.4324 0.9036 0.0964 
2 large lots —0.3326 —0.1323 0.5447 0.4248 0.5752 
3 nonretail acres 0.6142 0.1238 —0.5462 0.6909 0.3091 
5 nitric oxides 0.5917 0.0221 —0.7110 0.8561 0.1439 
6 rooms —0.3950 —0.5585 0.1153 0.4812 0.5188 
7 prior 1940 0.4665 0.1374 —0.7100 0.7406 0.2594 
8 empl. centers —0.4747 0.0198 0.8098 0.8816 0.1184 
9 accessibility 0.8879 —0.2874 —0.1409 0.8908 0.1092 
10 tax-rate 0.8518 -—0.1044 —0.2240 0.7867 0.2133 
11 pupil/teacher 0.5090 0.2061 —0.1093 0.3135 0.6865 
12 blacks —0.4834 —0.0418 0.1122 0.2480 0.7520 
13 lower status 0.6358 0.5690 —0.3252 0.8337 0.1663 
14 value —0.6817 —0.6193 0.1208 0.8628 0.1371 


Table 10.5. Estimated factor loadings, communalities, and specific vari- 
ances, MLM, varimax rotation. Q MVAfacthous.xpl 
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Figure 10.3. Factor analysis for Boston housing data, MLM after varimax 
rotation. Q@ MVAfacthous.xpl 


component method (PCM). For illustration, the MLM will be presented with and without 
varimax rotation. 


Table 10.4 gives the MLM factor loadings without rotation and Table 10.5 gives the varimax 
version of this analysis. The corresponding graphical representations of the loadings are 
displayed in Figures 10.2 and 10.3. We can see that the varimax does not significantly 
change the interpretation of the factors obtained by the MLM. Factor 1 can be roughly 
interpreted as a “quality of life factor” because it is positively correlated with variables like 
Xj , and negatively correlated with Xg, both having low specific variances. The second factor 
may be interpreted as a “residential factor”, since it is highly correlated with variables Xe, 
and X13. The most striking difference between the results with and without varimax rotation 
can be seen by comparing the lower left corners of Figures 10.2 and 10.3. There is a clear 
separation of the variables in the varimax version of the MLM. Given this arrangement of 
the variables in Figure 10.3, we can interpret factor 3 as an employment factor, since we 
observe high correlations with Xg and X5. 


We now turn to the PCM and PFM analyses. The results are presented in Tables 10.6 
and 10.7 and in Figures 10.4 and 10.5. We would like to focus on the PCM, because this 
3-factor model yields only one specific variance (unexplained variation) above 0.5. Looking 
at Figure 10.4, it turns out that factor 1 remains a “quality of life factor” which is clearly 
visible from the clustering of X5, X3, X19 and X, on the right-hand side of the graph, while 
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Estimated factor Specific 
loadings Communalities variances 
mn qe qs hi Pig =1—hi 

1 crime 0.9164 0.0152 0.2357 0.8955 0.1045 
2 large lots —0.6772 0.0762 0.4490 0.6661 0.3339 
3 nonretail acres 0.8614 —0.1321 —0.1115 0.7719 0.2281 
5 nitric oxides 0.9172 0.0573 —0.0874 0.8521 0.1479 
6 rooms —0.3590 0.7896 0.1040 0.7632 0.2368 
7 prior 1940 0.8392 —0.0008 —0.2163 0.7510 0.2490 
8 empl. centers —0.8928 —0.1253 0.2064 0.8554 0.1446 
9 accessibility 0.7562 0.0927 0.4616 0.7935 0.2065 
10 tax-rate 0.7891 —0.0370 0.4430 0.8203 0.1797 
11 pupil/teacher 0.4827 —0.3911 0.1719 0.4155 0.5845 
12 blacks —0.4499 0.0368 —0.5612 0.5188 0.4812 
13 lower status 0.6925 —0.5843 0.0035 0.8209 0.1791 
14 value —0.5933 0.6720 —0.1895 0.8394 0.1606 


Table 10.6. Estimated factor loadings, communalities, and specific vari- 
ances, PCM, varimax rotation. @Q MVAfacthous.xpl 
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Figure 10.4. Factor analysis for Boston housing data, PCM after varimax 
rotation. Q MVAfacthous.xpl 


10.4 Boston Housing 207 

Estimated factor Specific 

loadings Communalities variances 

mn qe ds hi’ Diy = 1 — hi 

1 crime 0.8579 —0.0270 —0.4175 0.9111 0.0889 
2 large lots —0.2953 =0.2168 ~——0.5756 0.4655 0.5345 
3 nonretail acres 0.5893 —0.2415 —0.5666 0.7266 0.2734 
5 nitric oxides 0.6050 —0.0892 —0.6855 0.8439 0.1561 
6 rooms —0.2902 0.6280 0.1296 0.4954 0.5046 
7 prior 1940 0.4702 —0.1741 —0.6733 0.7049 0.2951 
8 empl. centers —0.4988 0.0414 0.7876 0.8708 0.1292 
9 accessibility 0.8830 0.1187 —0.1479 0.8156 0.1844 
10 tax-rate 0.8969 —0.0136 —0.1666 0.8325 0.1675 
11 pupil/teacher 0.4590 —0.2798 —0.1412 0.3090 0.6910 
12 blacks —0.4812 0.0666 0.0856 0.2433 0.7567 
13 lower status 0.5433 —0.6604 —0.3193 0.8333 0.1667 
14 value —0.6012 0.7004 0.0956 0.8611 0.1389 


Table 10.7. Estimated factor loadings, communalities, and specific vari- 


ances, PFM, varimax rotation. Q@ MVAfacthous.xp1l 
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Figure 10.5. Factor analysis for Boston housing data, PFM after varimax 
rotation. Q MVAfacthous.xpl 
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the variables Xg, X2, X44, X12 and X¢ are on the left-hand side. Again, the second factor 
is a “residential factor”, clearly demonstrated by the location of variables X¢, Xy4, X11, 
and X13. The interpretation of the third factor is more difficult because all of the loadings 
(except for X12) are very small. 


10.5 Exercises 


EXERCISE 10.1 Jn Example 10.4 we have computed O and using the method of principal 
factors. We used a two-step iteration for V. Perform the third iteration step and compare 
the results (i.e., use the given Q as a pre-estimate to find the final V). 


EXERCISE 10.2 Using the bank data set, how many factors can you find with the Method 
of Principal Factors? 


EXERCISE 10.3 Repeat Exercise 10.2 with the U.S. company data set! 


EXERCISE 10.4 Generalize the two-dimensional rotation matriz in Section 10.2 to n-di- 
mensional space. 


EXERCISE 10.5 Compute the orthogonal factor model for 


1 Oe 
H={ 09 1 O4 
0.7 04 1 


(Solution: Wy. = —0.575, qi. = 1.255/ 
EXERCISE 10.6 Perform a factor analysis on the type of families in the French food data 


set. Rotate the resulting factors in a way which provides the most reasonable interpretation. 
Compare your result with the varimaxz method. 


EXERCISE 10.7 Perform a factor analysis on the variables X3 to X9 in the U.S. crime 
data set (Table B.10). Would it make sense to use all of the variables for the analysis? 


EXERCISE 10.8 Analyze the athletic records data set (Table B.18). Can you recognize any 
patterns if you sort the countries according to the estimates of the factor scores? 


EXERCISE 10.9 Perform a factor analysis on the U.S. health data set (Table B.16) and 
estimate the factor scores. 
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EXERCISE 10.10 Redo Exercise 10.9 using the U.S. crime data in Table B.10. Compare 
the estimated factor scores of the two data sets. 


EXERCISE 10.11 Analyze the vocabulary data given in Table B.17. 


11 Cluster Analysis 


The next two chapters address classification issues from two varying perspectives. When 
considering groups of objects in a multivariate data set, two situations can arise. Given a 
data set containing measurements on individuals, in some cases we want to see if some natural 
groups or classes of individuals exist, and in other cases, we want to classify the individuals 
according to a set of existing groups. Cluster analysis develops tools and methods concerning 
the former case, that is, given a data matrix containing multivariate measurements on a 
large number of individuals (or objects), the objective is to build some natural subgroups 
or clusters of individuals. This is done by grouping individuals that are “similar” according 
to some appropriate criterion. Once the clusters are obtained, it is generally useful to 
describe each group using some descriptive tool from Chapters 1, 8 or 9 to create a better 
understanding of the differences that exist among the formulated groups. 


Cluster analysis is applied in many fields such as the natural sciences, the medical sciences, 
economics, marketing, etc. In marketing, for instance, it is useful to build and describe the 
different segments of a market from a survey on potential consumers. An insurance company, 
on the other hand, might be interested in the distinction among classes of potential customers 
so that it can derive optimal prices for its services. Other examples are provided below. 


Discriminant analysis presented in Chapter 12 addresses the other issue of classification. 
It focuses on situations where the different groups are known a priori. Decision rules are 
provided in classifying a multivariate observation into one of the known groups. 


Section 11.1 states the problem of cluster analysis where the criterion chosen to measure the 
similarity among objects clearly plays an important role. Section 11.2 shows how to precisely 
measure the proximity between objects. Finally, Section 11.3 provides some algorithms. We 
will concentrate on hierarchical algorithms only where the number of clusters is not known 
in advance. 


11.1 The Problem 


Cluster analysis is a set of tools for building groups (clusters) from multivariate data objects. 
The aim is to construct groups with homogeneous properties out of heterogeneous large 
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samples. The groups or clusters should be as homogeneous as possible and the differences 
among the various groups as large as possible. Cluster analysis can be divided into two 
fundamental steps. 


1. Choice of a proximity measure: 
One checks each pair of observations (objects) for the similarity of their values. A 
similarity (proximity) measure is defined to measure the “closeness” of the objects. 
The “closer” they are, the more homogeneous they are. 


2. Choice of group-building algorithm: 
On the basis of the proximity measures the objects assigned to groups so that differences 
between groups become large and observations in a group become as close as possible. 


In marketing, for exmaple, cluster analysis is used to select test markets. Other applications 
include the classification of companies according to their organizational structures, technolo- 
gies and types. In psychology, cluster analysis is used to find types of personalities on the 
basis of questionnaires. In archaeology, it is applied to classify art objects in different time 
periods. Other scientific branches that use cluster analysis are medicine, sociology, linguis- 
tics and biology. In each case a heterogeneous sample of objects are analyzed with the aim 
to identify homogeneous subgroups. 


ar Summary 


<+ Cluster analysis is a set of tools for building groups (clusters) from multi- 
variate data objects. 


<+ The methods used are usually divided into two fundamental steps: The 
choice of a proximity measure and the choice of a group-building algo- 
rithm. 


11.2. The Proximity between Objects 


The starting point of a cluster analysis is a data matrix ¥(n x p) with n measurements 
(objects) of p variables. The proximity (similarity) among objects is described by a matrix 
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D(n x n) 
diy dio ieee aise eeShe din 
Sie, : 
Dao - Ws (11.1) 
dni dye its 


The matrix D contains measures of similarity or dissimilarity among the n objects. If the 
values d;; are distances, then they measure dissimilarity. The greater the distance, the less 
similar are the objects. If the values d;; are proximity measures, then the opposite is true, 
i.e., the greater the proximity value, the more similar are the objects. A distance matrix, 
for example, could be defined by the L2-norm: dj; = ||x; — x;||2, where x; and x; denote the 
rows of the data matrix VY. Distance and similarity are of course dual. If d;; is a distance, 
then di, = max;;{di;} — di; is a proximity measure. 


The nature of the observations plays an important role in the choice of proximity measure. 
Nominal values (like binary variables) lead in general to proximity values, whereas metric 
values lead (in general) to distance matrices. We first present possibilities for D in the binary 
case and then consider the continuous case. 


Similarity of objects with binary structure 


In order to measure the similarity between objects we always compare pairs of observations 


£i,a,) where z] = (r4,...,%m), ) = (51,..., 24), and 2,03, € {0,1}. Obviously there 
5 i Pp j jp j 


j 
are four cases: 


Lik = Vik = di 
Lik = 0, 7% = 1, 
Lik = 1, X jn = 0, 


Lik = Vik = 0. 
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Name 6 A Definition 
a 
Jaccard i. oe 
Tanimoto i 3 ay + 4 
ay + 2(ag + a3) + a4 

Simple Matching (M) 1 1 ay + a4 
Russel and Rao (RR) — = — o 

i 2a 
D , 

ice 0 0.5 Waa (as ea) 

. a a 

Kulcezynski ap as 


Table 11.2. The common similarity coefficients. 


Define 
Pp 
Qu = So T(xir = Lik oo 1, 
k=1 
P 
aq = So L(xiz = 0, ix = iD 
k=1 
P 
a30 = So Lai = Laan = 0), 
k=1 
P 
ag = So L(ci = Vik = 0). 


> 
ll 
un 


Note that each ap, €=1,...,4, depends on the pair (2;,2;). 
The following proximity measures are used in practice: 


ay+ a4 
ay, + da4 + A(a2 + a3) 


where 6 and X are weighting factors. Table 11.2 shows some similarity measures for given 
weighting factors. 


These measures provide alternative ways of weighting mismatchings and positive (presence of 
a common character) or negative (absence of a common character) matchings. In principle, 
we could also consider the Euclidian distance. However, the disadvantage of this distance is 
that it treats the observations 0 and 1 in the same way. If x;, = 1 denotes, say, knowledge of 
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a certain language, then the contrary, xj, = 0 (not knowing the language) should eventually 
be treated differently. 


EXAMPLE 11.1 Let us consider binary variables computed from the car data set (Table 
B.7). We define the new binary data by 
Sf nes 1 if Lik > Lk; 
mia 0 otherwise, 
fori =1,...,n andk =1,...,p. This means that we transform the observations of the k-th 
variable to 1 if it is larger than the mean value of all observations of the k-th variable. Let 


us only consider the data points 17 to 19 (Renault 19, Rover and Toyota Corolla) which lead 
to (3 x 3) distance matrices. The Jaccard measure gives the similarity matrix 


1.000 0.000 0.333 
Ds 1.000 0.250 |, 
1.000 


the Tanimoto measure yields 


1.000 0.231 0.600 
D= 1.000 0.455 |, 
1.000 


whereas the Single Matching measure gives 
1.000 0.375 0.750 


De 1.000 0.625 
1.000 


Distance measures for continuous variables 


A wide variety of distance measures can be generated by the L,-norms, r > 1, 


D 1/r 
diy = ||ta — j||r = 1S Zin — cal | ; (11.3) 
k=1 


Here x;, denotes the value of the k-th variable on object 7. It is clear that d;; = 0 for 
i =1,...,n. The class of distances (11.3) for varying r measures the dissimilarity of dif- 
ferent weights. The L,-metric, for example, gives less weight to outliers than the L2-norm 
(Euclidean norm). It is common to consider the squared L2-norm. 
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EXAMPLE 11.2 Suppose we have x; = (0,0), x2 = (1,0) and x3 = (5,5). Then the distance 
matrix for the L,-norm is 


0 1 10 
Di = 1). OTs 
10 9 O 
and for the squared L2- or Euclidean norm 
0 1 50 
mel io Al 
50 41 0 


One can see that the third observation x3 receives much more weight in the squared Lo-norm 
than in the Ly-norm. 


An underlying assumption in applying distances based on L,-norms is that the variables are 
measured on the same scale. If this is not the case, a standardization should first be applied. 
This corresponds to using a more general Lo- or Euclidean norm with a metric A, where 
A > 0 (see Section 2.6): 


dz, = ||zi — x, ||, = (#1 — 23) 'A(@ — 2;). (11.4) 


L[-norms are given by A = Z,, but if a standardization is desired, then the weight matrix 
A= diag ss ax , 8x,x,) may be suitable. Recall that sx,x, is the variance of the k-th 
component. Hence we have 
P ) 
(Lik —& ik) 
= 7 (11.5) 
k=l eX 
Here each component has the same weight in the computation of the distances and the 
distances do not depend on a particular choice of the units of measure. 


EXAMPLE 11.3 Consider the French Food expenditures (Table B.6). The Euclidean dis- 


tance matrix (squared L2-norm) is 


0.00 5.82 58.19 3.54 5.15 151.44 16.91 36.15 147.99 51.84 102.56 271.83 
0.00 41.73 4.53 2.93 120.59 13.52 25.39 116.31 43.68 76.81 226.87 
0.00 44.14 40.10 24.12 29.95 8.17 25.57 20.81 20.30 88.62 

0.00 0.76 127.85 5.62 21.70 124.98 31.21 72.97 231.57 

0.00 121.05 5.70 19.85 118.77 30.82 67.39 220.72 

0.00 96.57 48.16 1.80 60.52 28.90 29.56 

0.00 9.20 94.87 11.07 42.12 179.84 

0.00 46.95 6.17 18.76 113.03 

0.00 61.08 29.62 31.86 

0.00 15.83 116.11 

0.00 53.77 

0.00 


D = 10+. 
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Taking the weight matrit A = diag(sxt Nyerieg Le x,), we obtain the distance matrix (squared 
L2-norm) 


0.00 6.85 10.04 1.68 2.66 24.90 8.28 8.56 24.61 21.55 30.68 57.48 
0.00 13.11 6.59 3.75 20.12 13.13 12.38 15.88 31.52 25.65 46.64 
0.00 8.03 7.27 4.99 9.27 3.88 7.46 14.92 15.08 26.89 
0.00 0.64 20.06 2.76 3.82 19.63 12.81 19.28 45.01 
0.00 17.00 3.54 3.81 15.76 14.98 16.89 39.87 
D= 0.00 17.51 9.79 1.58 21.32 11.36 13.40 (11.6) 
0.00 1.80 17.92 4.39 9.93 33.61 | ° ; 
0.00 10.50 5.70 7.97 24.41 
0.00 24.75 11.02 13.07 
0.00 9.13 29.78 
0.00 9.39 
0.00 


When applied to contingency tables, a y?-metric is suitable to compare (and cluster) rows 
and columns of a contingency table. 
If ¥ is a contingency table, row 7 is characterized by the conditional frequency distribution 


Vij Xie 


at where tie = ii xj; indicates the marginal distributions over the rows: =*, Tee = 


1 Vie. Similarly, column j of ¥ is characterized by the conditional frequencies =, where 
7 
Lej = >>), Vij The marginal frequencies of the columns are ao : 


The distance between two rows, 7; and 22, corresponds to the distance between their re- 
spective frequency distributions. It is common to define this distance using the y?-metric: 


Pi Tey : G ~ at) (11.7) 
1, 42) — = (2) Line Line : é 


Lee 


Note that this can be expressed as a distance between the vectors 7, = (=) and v2 = 


(=) as in (11.4) with weighting matrix A = { diag (22) \ . Similarly, if we are interested 


in clusters among the columns, we can define: 


n 


Pj, j2) = D0 a (= 7 Bey 


i=1 (5 i 
ee 


Apart from the Euclidean and the Z,-norm measures one can use a proximity measure such 
as the Q-correlation coefficient 
a = ae 1/2° . 
{Sohea (an — Fi)? haa (apn — Fy)?}" 


Here Z; denotes the mean over the variables (j1,..., Lip). 
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car Summary 


<— The proximity between data points is measured by a distance or similar- 
ity matrix D whose components d;; give the similarity coefficient or the 
distance between two points x; and 2;. 


<+ A variety of similarity (distance) measures exist for binary data (e.g., 
Jaccard, Tanimoto, Simple Matching coefficients) and for continuous data 
(e.g., L,-norms). 


<> The nature of the data could impose the choice of a particular metric A 
in defining the distances (standardization, .?-metric etc.). 


11.3. Cluster Algorithms 


There are essentially two types of clustering methods: hierarchical algorithms and partioning 
algorithms. The hierarchical algorithms can be divided into agglomerative and splitting 
procedures. The first type of hierarchical clustering starts from the finest partition possible 
(each observation forms a cluster) and groups them. The second type starts with the coarsest 
partition possible: one cluster contains all of the observations. It proceeds by splitting the 
single cluster up into smaller sized clusters. 


The partioning algorithms start from a given group definition and proceed by exchanging 
elements between groups until a certain score is optimized. The main difference between 
the two clustering techniques is that in hierarchical clustering once groups are found and 
elements are assigned to the groups, this assignment cannot be changed. In partitioning 
techniques, on the other hand, the assignment of objects into groups may change during the 
algorithm application. 


Hierarchical Algorithms, Agglomerative Techniques 


Agglomerative algorithms are used quite frequently in practice. The algorithm consists of 
the following steps: 
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matrix D. 


Agglomerative Algorithm 


1. Construct the finest partition. 


2. Compute the distance matrix D. 


3. Find the two clusters with the closest distance. 
4. Put those two clusters into one cluster. 


5. Compute the distance between the new groups and obtain a reduced distance 


UNTIL all clusters are agglomerated into %. 


If two objects or groups say, P and Q, are united, one computes the distance between this 
new group (object) P+ Q and group R using the following distance function: 


a(R, P+ Q) = 61d(R, P) + d2d(R, Q) + d3d(P, Q) + dld(R, P) — d(R, Q)|. 


(11.9) 


The 06;’s are weighting factors that lead to different agglomerative algorithms as described 
in Table 11.4. Here np = >>, I(x; € P) is the number of objects in group P. The values 
of ng and np are defined analogously. 


Name on oD) 03 O4 
Single linkage ly2 12 0 -1/2 
Complete linkage 1/2 172 0 1/2 
Average linkage 172 1/2 0 0 
(unweighted) o 
Average linkage np —— 0 0 
(weighted) _ ma pene 

P 
Centroid Np +Ng np +ng ~ (np tno)? 0 
Median 1/2 1/2 -1/4 0 

Nrptnp Net nog _ NR 

Ware NMRet+np+ng NRetnNptng NnR+np+ng 0 


Table 11.4. Computations of group distances. 
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EXAMPLE 11.4 Let us examine the agglomerative algorithm for the three points in Ex- 
ample 11.2, x, = (0,0), v2 = (1,0) and x3 = (5,5), and the squared Euclidean distance 
matrix with single linkage weighting. The algorithm starts with N = 3 clusters: P = {xy}, 
Q = {x2} and R= {x3}. The distance matrix D2 is given in Example 11.2. The smallest 
distance in Dz is the one between the clusters P and Q. Therefore, applying step 4 in the 
above algorithm we combine these clusters to form P+ Q = {21,22}. The single linkage 
distance between the remaining two clusters is from Table 11.4 and (11.9) equal to 


1 1 1 
d(R,P+Q) = 5d(R,P) + 5d(R,Q) — sld(R, P) — dR, Q)| (11.10) 
1 1 1 
= 513 5423 73° |diz — do3| 
50... 41 
SS eS Ss —A4l 
5 + a 9 |50 | 
= Al, 
The reduced distance matrix is then ( - a The neat and last step is to unite the clusters 


Rand P+Q into a single cluster X, the original data matriz. 


When there are more data points than in the example above, a visualization of the implication 
of clusters is desirable. A graphical representation of the sequence of clustering is called a 
dendrogram. It displays the observations, the sequence of clusters and the distances between 
the clusters. The vertical axis displays the indices of the points, whereas the horizontal 
axis gives the distance between the clusters. Large distances indicate the clustering of 
heterogeneous groups. Thus, if we choose to “cut the tree” at a desired level, the branches 
describe the corresponding clusters. 


EXAMPLE 11.5 Here we describe the single linkage algorithm for the eight data points 
displayed in Figure 11.1. The distance matrix (L2-norms) is 


0 10 53 73 50 98 41 65 
0 25 41 20 80 37 65 

0 2 1 25 18 34 

0 5 17 20 32 


v= 0 36 25 45 
0 13 9 

0 4 

0 


and the dendrogram is shown in Figure 11.2. 


If we decide to cut the tree at the level 10, three clusters are defined: {1,2}, {3,4,5} and 
{6,7, 8}. 
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second coordinate 


first coordinate 


Figure 11.1. The 8-point example. @ MVAclus8p.xpl 


The single linkage algorithm defines the distance between two groups as the smallest value 
of the individual distances. Table 11.4 shows that in this case 


d(R, P + Q) = min{d(R, P),d(R, Q)}. (11.11) 


This algorithm is also called the Nearest Neighbor algorithm. As a consequence of its con- 
struction, single linkage tends to build large groups. Groups that differ but are not well sep- 
arated may thus be classified into one group as long as they have two approximate points. 
The complete linkage algorithm tries to correct this kind of grouping by considering the 
largest (individual) distances. Indeed, the complete linkage distance can be written as 


d(R,P+Q) = max{d(R, P), d(R,Q)}. (11.12) 


It is also called the Farthest Neighbor algorithm. This algorithm will cluster groups where 
all the points are proximate, since it compares the largest distances. The average link- 
age algorithm (weighted or unweighted) proposes a compromise between the two preceding 
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Single Linkage Dendrogram - 8 points 


20 


15 


Squared Euclidian Distance 
10 


Figure 11.2. The dendrogram for the 8-point example, Single linkage 
algorithm. @ MVAclus8p.xpl 


algorithms, in that it computes an average distance: 


“P__d(R, P)+— 2 d(R,Q). (11.13) 


d(R, P + Q) = —*— 
( Q) npt+nge np+ng 


The centroid algorithm is quite similar to the average linkage algorithm and uses the nat- 
ural geometrical distance between R and the weighted center of gravity of P and Q (see 
Figure 11.3): 


d(R, P+ Q) = np kag ® P)+ ap eng Q)- yk Q). (11.14) 


The Ward clustering algorithm computes the distance between groups according to the for- 
mula in Table 11.4. The main difference between this algorithm and the linkage procedures is 
in the unification procedure. The Ward algorithm does not put together groups with small- 
est distance. Instead, it joins groups that do not increase a given measure of heterogeneity 
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ee | Q 


weighted center of gravity of P+ Q 


Figure 11.3. The centroid algorithm. 


“too much”. The aim of the Ward procedure is to unify groups such that the variation inside 
these groups does not increase too drastically: the resulting groups are as homogeneous as 
possible. 


The heterogeneity of group R is measured by the inertia inside the group. This inertia is 
defined as follows: 


Ip = 5 Ot (tis TR) (11.15) 


where Zp is the center of gravity (mean) over the groups. Ip clearly provides a scalar measure 
of the dispersion of the group around its center of gravity. If the usual Euclidean distance is 
used, then Ip represents the sum of the variances of the p components of x; inside group R. 


When two objects or groups P and Q are joined, the new group P+ Q has a larger inertia 
Ipsg- It can be shown that the corresponding increase of inertia is given by 


WPNQ. 43 
AP Oa = fe.) 11.16 
(P,Q) = MN #(P.Q) (11.16) 
In this case, the Ward algorithm is defined as an algorithm that “joins the groups that give 
the smallest increase in A(P,Q)”. It is easy to prove that when P and Q are joined, the new 
criterion values are given by (11.9) along with the values of 6; given in Table 11.4, when the 
centroid formula is used to modify d?(R, P + Q). So, the Ward algorithm is related to the 


centroid algorithm, but with an “inertial” distance A rather than the “geometric” distance 
a. 


As pointed out in Section 11.2, all the algorithms above can be adjusted by the choice of 
the metric A defining the geometric distance d?. If the results of a clustering algorithm 
are illustrated as graphical representations of individuals in spaces of low dimension (using 
principal components (normalized or not) or using a correspondence analysis for contingency 
tables), it is important to be coherent in the choice of the metric used. 
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20 Swiss bank notes 


second PC 


first PC 


Figure 11.4. PCA for 20 randomly chosen’ bank notes. 
Q MVAclusbank.xpl 


EXAMPLE 11.6 As an example we randomly select 20 observations from the bank notes 
data and apply the Ward technique using Euclidean distances. Figure 11.4 shows the first 
two PCs of these data, Figure 11.5 displays the dendrogram. 


EXAMPLE 11.7 Consider the French food expenditures. As in Chapter 9 we use standard- 
ized data which is equivalent to using A = diag(sx, ae ae x,) as the weight matrix in 
the Ly-norm. The NPCA plot of the individuals was given in Figure 9.7. The Euclidean 
distance matrix is of course given by (11.6). The dendrogram obtained by using the Ward 
algorithm is shown in Figure 11.6. 


If the aim was to have only two groups, as can be seen in Figure 11.6 , they would be {CA2, 
CA8, CA4, CAS, EM5} and {MA2, MA8, MA4, MA5, EM2, EM3, EM4}. Clustering three 
groups is somewhat arbitrary (the levels of the distances are too similar). If we were interested 
in four groups, we would obtain {CA2, CA3, CA4}, {EM2, MA2, EM3, MA3}, {EM4, MA4, 
MA5} and {EM5, CA5}. This grouping shows a balance between socio-professional levels and 
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Dendrogram for 20 Swiss bank notes 
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Figure 11.5. The dendrogram for the 20 bank notes, Ward algorithm. 
Q MVAclusbank.xpl 


size of the families in determining the clusters. The four groups are clearly well represented 
in the NPCA plot in Figure 9.7. 


ae Summary 


<— The class of clustering algorithms can be divided into two types: hierar- 
chical and partitioning algorithms. Hierarchical algorithms start with the 
finest (coarsest) possible partition and put groups together (split groups 
apart) step by step. Partitioning algorithms start from a preliminary clus- 
tering and exchange group elements until a certain score is reached. 
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Summary (continued) 


< Hierarchical agglomerative techniques are frequently used in practice. 
They start from the finest possible structure (each data point forms a 
cluster), compute the distance matrix for the clusters and join the clus- 
ters that have the smallest distance. This step is repeated until all points 
are united in one cluster. 


<< The agglomerative procedure depends on the definition of the distance 
between two clusters. Single linkage, complete linkage, and Ward distance 
are frequently used distances. 


<+ The process of the unification of clusters can be graphically represented 
by a dendrogram. 


11.4 Boston Housing 


We have motivated the transformation of the variables of the Boston housing data many 
times before. Now we illustrate the cluster algorithm with the transformed data 4 excluding 
X,4 (Charles River indicator). Among the various algorithms, the results from the Ward 
algorithm are presented since this algorithm gave the most sensible results. In order to be 


Variable Mean Cl SE Cl Mean C2 SE C2 
1 —0.7105 0.0332 0.6994 0.0535 
2 0.4848 0.0786  —0.4772 0.0047 
3 —0.7665 0.0510 0.7545 0.0279 
5 —0.7672 0.0365 0.7552 0.0447 
6 0.4162 0.0571 —0.4097 0.0576 
7 —0.7730 0.0429 0.7609 0.0378 
8 0.7140 0.0472  —0.7028 0.0417 
9 —0.5429 0.0358 0.5344 0.0656 
10 —0.6932 0.0301 0.6823 0.0569 
11 —0.5464 0.0469 0.5378 0.0582 
12 0.3547 0.0080 —0.3491 0.0824 
13 —0.6899 0.0401 0.6791 0.0509 
14 0.5996 0.0431 —0.5902 0.0570 


Table 11.6. Means and standard errors of the 13 standardized vari- 
ables for Cluster 1 (251 observations) and Cluster 2 (255 observations). 


Q MVAclusbh.xpl 
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Figure 11.6. The dendrogram for the French food expenditures, Ward 
algorithm. @ MVAclusfood.xpl 


coherent with our previous analysis, we standardize each variable. The dendrogram of the 
Ward method is displayed in Figure 11.7. Two dominant clusters are visible. A further 
refinement of say, 4 clusters, could be considered at a lower level of distance. 


To interprete the two clusters, we present the mean values and their respective standard 
errors of the thirteen ¥ variables by group in Table 11.6. Comparing the mean values for 
both groups shows that all the differences in the means are individually significant and that 
cluster one corresponds to housing districts with better living quality and higher house prices, 
whereas cluster two corresponds to less favored districts in Boston. This can be confirmed, 
for instance, by a lower crime rate, a higher proportion of residential land, lower proportion 
of blacks, etc. for cluster one. Cluster two is identified by a higher proportion of older 
houses, a higher pupil/teacher ratio and a higher percentage of the lower status population. 


This interpretation is underlined by visual inspection of all the variables presented on scat- 
terplot matrices in Figures 11.8 and 11.9. For example, the lower right boxplot of Figure 11.9 
and the correspondingly colored clusters in the last row confirm the role of each variable in 
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Figure 11.7. Dendrograms of the Boston housing data using the Ward 
algorithm. @ MVAclusbh.xpl 


determining the clusters. This interpretation perfectly coincides with the previous PC anal- 
ysis (Figure 9.11). The quality of life factor is clearly visible in Figure 11.10, where cluster 
membership is distinguished by the shape and color of the points graphed according to the 
first two principal components. Clearly, the first PC completely separates the two clusters 
and corresponds, as we have discussed in Chapter 9, to a quality of life and house indicator. 


11.5 Exercises 


EXERCISE 11.1 Prove formula (11.16). 


EXERCISE 11.2 Prove that Ip = tr(Sp), where Sr denotes the empirical covariance matrix 
of the observations contained in R. 
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Figure 11.8. Scatterplot matrix for variables X, to x of the Boston 
housing data. @ MVAclusbh.xpl 


EXERCISE 11.3 Prove that 


ee AG Ae = 


A(R, P+ = —_—_ — —__ 
( 2) NR+Np+ ng nR+np+ng nR+np+ng 


AUP, @), 


when the centroid formula is used to define d?(R,P +Q). 
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Figure 11.9. Scatterplot matrix for variables a to cer of the Boston 
housing data. @Q MVAclusbh.xp1l 


EXERCISE 11.4 Repeat the 8-point example (Example 11.5) using the complete linkage and 
the Ward algorithm. Explain the difference to single linkage. 


EXERCISE 11.5 Explain the differences between various proximity measures by means of 
an example. 
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first vs. second PC 


Figure 11.10. Scatterplot of the first two PCs displaying the two clusters. 
Q MVAclusbh.xpl 


EXERCISE 11.6 Repeat the bank notes example (Example 11.6) with another random sam- 
ple of 20 notes. 


EXERCISE 11.7 Repeat the bank notes example (Example 11.6) with another clustering 
algorithm. 


EXERCISE 11.8 Repeat the bank notes example (Example 11.6) or the 8-point example 
(Example 11.5) with the Ly-norm. 


EXERCISE 11.9 Analyze the U.S. companies example (Table B.5) using the Ward algorithm 
and the L2-norm. 


EXERCISE 11.10 Analyze the U.S. crime data set (Table B.10) with the Ward algorithm 
and the Ly-norm on standardized variables (use only the crime variables). 
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EXERCISE 11.11 Repeat Exercise 11.10 with the U.S. health data set (use only the number 
of deaths variables). 


EXERCISE 11.12 Redo Exercise 11.10 with the y?-metric. Compare the results. 


EXERCISE 11.13 Redo Exercise 11.11 with the x?-metric and compare the results. 


12 Discriminant Analysis 


Discriminant analysis is used in situations where the clusters are known a priori. The aim 
of discriminant analysis is to classify an observation, or several observations, into these 
known groups. For instance, in credit scoring, a bank knows from past experience that there 
are good customers (who repay their loan without any problems) and bad customers (who 
showed difficulties in repaying their loan). When a new customer asks for a loan, the bank 
has to decide whether or not to give the loan. The past records of the bank provides two 
data sets: multivariate observations x; on the two categories of customers (including for 
example age, salary, marital status, the amount of the loan, etc.). The new customer is 
a new observation x with the same variables. The discrimination rule has to classify the 
customer into one of the two existing groups and the discriminant analysis should evaluate 
the risk of a possible “bad decision”. 


Many other examples are described below, and in most applications, the groups correspond 
to natural classifications or to groups known from history (like in the credit scoring example). 
These groups could have been formed by a cluster analysis performed on past data. 


Section 12.1 presents the allocation rules when the populations are known, i.e., when we know 
the distribution of each population. As described in Section 12.2 in practice the population 
characteristics have to be estimated from history. The methods are illustrated in several 
examples. 


12.1 Allocation Rules for Known Distributions 


Discriminant analysis is a set of methods and tools used to distinguish between groups of 
populations II; and to determine how to allocate new observations into groups. In one of 
our running examples we are interested in discriminating between counterfeit and true bank 
notes on the basis of measurements of these bank notes, see Table B.2. In this case we have 
two groups (counterfeit and genuine bank notes) and we would like to establish an algorithm 
(rule) that can allocate a new observation (a new bank note) into one of the groups. 


Another example is the detection of “fast” and “slow” consumers of a newly introduced 
product. Using a consumer’s characteristics like education, income, family size, amount 
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of previous brand switching, we want to classify each consumer into the two groups just 
identified. 


In poetry and literary studies the frequencies of spoken or written words and lengths of 
sentences indicate profiles of different artists and writers. It can be of interest to attribute 
unknown literary or artistic works to certain writers with a specific profile. Anthropological 
measures on ancient sculls help in discriminating between male and female bodies. Good 
and poor credit risk ratings constitute a discrimination problem that might be tackled using 
observations on income, age, number of credit cards, family size etc. 


In general we have populations II,;,7 = 1,2,..., J and we have to allocate an observation x 
to one of these groups. A discriminant rule is a separation of the sample space (in general 
R?) into sets R; such that if x € R;, it is identified as a member of population I]). 


The main task of discriminant analysis is to find “good” regions R; such that the error 
of misclassification is small. In the following we describe such rules when the population 
distributions are known. 


Maximum Likelihood Discriminant Rule 


Denote the densities of each population H,; by f;(x). The maximum likelihood discriminant 
rule (ML rule) is given by allocating x to II; maximizing the likelihood L,(x) = f;(a) = 
max; f;(x). 

If several f; give the same maximum then any of them may be selected. Mathematically, 
the sets R; given by the ML discriminant rule are defined as 


R= tee Eel) le) fora = lets else (12.1) 


By classifying the observation into certain group we may encounter a misclassification error. 
For J = 2 groups the probability of putting x into group 2 although it is from population 1 
can be calculated as 


pa = P(X — Re|Th) = fil(a)dx. (12.2) 
Ro 


Similarly the conditional probability of classifying an object as belonging to the first popu- 
lation II, although it actually comes from I] is 


P2 = P(X — R,|Ue) = fo(x)dz. (12.3) 
Ri 


The misclassified observations create a cost C(i|j) when a II; observation is assigned to R;. 
In the credit risk example, this might be the cost of a “sour” credit. The cost structure can 
be pinned down in a cost matrix: 
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Classified population 


I, I, 
Tl, 0 C(2|1) 
True population 
II, | C(1|2) 0 


Let 7; be the prior probability of population I1;, where “prior” means the a priori probability 
that an individual selected at random belongs to II, (i-e., before looking to the value x). Prior 
probabilities should be considered if it is clear ahead of time that an observation is more 
likely to stem from a certain population IT;. An example is the classification of musical tunes. 
If it is known that during a certain period of time a majority of tunes were written by a 
certain composer, then there is a higher probability that a certain tune was composed by this 
composer. Therefore, he should receive a higher prior probability when tunes are assigned 
to a specific group. 


The expected cost of misclassification (EHC'M) is given by 
ECM = C(2|1)poim + C(1|2)pi2712. (12.4) 
We will be interested in classification rules that keep the ECM small or minimize it over 


a class of rules. The discriminant rule minimizing the ECM (12.4) for two populations is 
given below. 


THEOREM 12.1 For two given populations, the rule minimizing the ECM is given by 
m= 19a 2 (cam) )} 
fe (Z) C(2 1) Ty 
m= (206) < Lean) Ga) 
fe (x) C(2 1) Ty 
The ML discriminant rule is thus a special case of the ECM rule for equal misclassification 


costs and equal prior probabilities. For simplicity the unity cost case, C(1|2) = C(2|1) = 1, 
and equal prior probabilities, 72 = 71, are assumed in the following. 


Theorem 12.1 will be proven by an example from credit scoring. 


EXAMPLE 12.1 Suppose that Il, represents the population of bad clients who create the 
cost C'(2|1) if they are classified as good clients. Analogously, define C(1|2) as the cost of 
loosing a good client classified as a bad one. Let y denote the gain of the bank for the correct 
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classification of a good client. The total gain of the bank is then 
G(R) = ~C(2|t)m f 1x € Rs) falw)de — C(1}2)me f {1 T(w € Ra) fale)dz 
tym f (0 € Ro) fo(x)dx 
= —C(1|2)m + [i € Ry){-—C(2|1)m fi(x) + (C12) + 7) te fo(x) }dx 
Since the first term in this equation is constant, the maximum 1s obviously obtained for 
Ra={e@ : —C(2|1)m file) + {C(12) + y}mafola) 2 0}. 
This 1s equivalent to fala) cra} 
2 T 
a {7 fila) > CREST AR 


which corresponds to the set Rg in Theorem 12.1 for a gain of y = 0. 


EXAMPLE 12.2 Suppose x € {0,1} and 


Th : P(X =0)=P(X=1)=5 
Ih : P(X =0)=4=1-P(X=1). 


The sample space is the set {0,1}. The ML discriminant rule is to allocate x = 0 to I, and 
x =1 to Ils, defining the sets R, = {0}, Ro = {1} and R, U Rp = {0,1}. 
EXAMPLE 12.3 Consider two normal populations 


Ih : N(1, 07), 
Tl, : N(t2,03). 


L,(x) = (2n02)~"/? exp 1-3 (: =H) . 


Hence x is allocated to I, (x € R1) if Li(x) > Lo(x). Note that Ly(x) > Lo(x) is equivalent 
to 


Then 


or 


1 1 2 2 
2 (4-4) -2 (4-4) + (4-4) < ato % (12.5) 
Oj 0% O{ OF OO, 0% O1 
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2 Normal distributions 


densities 


Figure 12.1. Maximum likelihood rule for normal distributions. 
Q MVAdisnorm.xpl 


Suppose that uw, =0, 0, = 1 and pa = 1, 02 = *. Formula (12.5) leads to 


{ries 3 (1- Vi 0108) orn > 5 (44 Vi+ Give) }. 


Ro = R\ Ry. 


Ry 


This situation is shown in Figure 12.1. 
The situation simplifies in the case of equal variances 0, = 02. The discriminant rule (12.5) 
is then ( for ts < p2) 


tolh, if ceR={x:r< f(uit po}, (12.6) 
tlk, if ce R,={x:2>F(u14+ pe)}- 


Theorem 12.2 shows that the ML discriminant rule for multinormal observations is inti- 
mately connected with the Mahalanobis distance. The discriminant rule is based on linear 
combinations and belongs to the family of Linear Discriminant Analysis (LDA) methods. 
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THEOREM 12.2 Suppose I; = N,(i, X). 


(a) The ML rule allocates x to I1;, where j € {1,..., J} is the value minimizing the square 
Mahalanobis distance between x and [;: 


Ci) = Pan) 2 eh) = Dison 5 


(b) In the case of J = 2, 
ceR, = al(x#-p)>0, 


where o = 71 (41 — fa) and pe = $(p1 + ple). 


Proof: 
Part (a) of the Theorem follows directly from comparison of the likelihoods. 


For J = 2, part (a) says that x is allocated to I, if 
(@ — pi)" O'(@ — pa) < @ — a) "Z*(e — pa) 
Rearranging terms leads to 
— 2p Doha + pig BAe + py Up — fg DU" pe < 0, 
which is equivalent to 


2(p — py) ST + (py — pe) EU (a + pla) <0, 


_ 1 
(141 — 2) Oo '{a — =(p1 + p2)} = 0, 
2 


a! (x — py) > 0. 


Bayes Discriminant Rule 


We have seen an example where prior knowledge on the probability of classification into 
II, was assumed. Denote the prior probabilities by 7; and note that en nm; = 1. The 
Bayes rule of discrimination allocates x to the II; that gives the largest value of 1; f;(), 
1; fj;(x) = max; 7;f;(v). Hence, the discriminant rule is defined by R; = {x : 1; f;(x) = 
mifi(x) fori = 1,...,J7}. Obviously the Bayes rule is identical to the ML discriminant rule 
10k Wg. J 
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A further modification is to allocate x to H, with a certain probability ¢,(#), such that 
~ @;(x) = 1 for all x. This is called a randomized discriminant rule. A randomized 
discriminant rule is a generalization of deterministic discriminant rules since 


pee { 1 if aya) Sma Ela), 


0 otherwise 


reflects the deterministic rules. 


Which discriminant rules are good? We need a measure of comparison. Denote 


fig / bila) fy(a)de (12.7) 


as the probability of allocating x to II; if it in fact belongs to II;. A discriminant rule with 
probabilities p;; is as good as any other discriminant rule with probabilities pj, if 


pee pe Aor all, 4S Leal: (12.8) 


We call the first rule better if the strict inequality in (12.8) holds for at least one i. A 
discriminant rule is called admissible if there is no better discriminant rule. 


THEOREM 12.3 All Bayes discriminant rules (including the ML rule) are admissible. 


Probability of Misclassification for the ML rule (J = 2) 


Suppose that II; = N,(u:,U). In the case of two groups, it is not difficult to derive the 
probabilities of misclassification for the ML discriminant rule. Consider for instance pj2 = 
P(a € R, | Iz). By part (b) in Theorem 12.2 we have 


Piz = P{a! (x — p) > 0| Ip}. 


If X € Ro, a'(X — p) ~ N (—46?, 6?) where 6? = (1 — 2)'X7* (“1 — fu) is the squared 
Mahalanobis distance between the two populations, we obtain 


1 
P12 >= ® (-55) ‘ 


Similarly, the probability of being classified into population 2 although x stems from I], is 
equal to po;=® (—40). 
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Classification with different covariance matrices 


The minimum EC'M depends on the ratio of the densities ao or equivalently on the dif- 


ference In{ fi(x)} — In{ fo(x)}. When the covariance for both density functions differ, the 
allocation rule becomes more complicated: 


Ry = {2 —507(E53 — By! )e + (up Ly! — fy z')e—k >In (Sony) (=)] \ : 


(2|1) 7 \m 
ee : : C(1)2)\ (ms 
iy = {" 50 (Si ta gi Se ee eee ln (San) (=)] \ ; 
where k = $1n (et) + 4(up Uy" — wy Dz""2). The classification regions are defined by 


quadratic functions. Therefore they belong to the family of Quadratic Discriminant Analysis 
(QDA) methods. This quadratic classification rule coincides with the rules used when 4) = 
Ng, since the term 427(Oy' — Uj')a disappears. 


iar Summary 


<+ Discriminant analysis is a set of methods used to distinguish among groups 
in data and to allocate new observations into the existing groups. 


<— Given that data are from populations II; with densities f;, 7 = 1,..., J, 
the maximum likelihood discriminant rule (ML rule) allocates an ob- 
servation « to that population II; which has the maximum likelihood 
Le) = f(a) = max Fe). 

< Given prior probabilities 7; for populations II;, Bayes discriminant rule 
allocates an observation x to the population H, that maximizes 7; f;() 
with respect to 7. All Bayes discriminant rules (incl. the ML rule) are 
admissible. 


<— For the ML rule and J = 2 normal populations, the probabilities of mis- 
classification are given by piz2 = p21 = ® (—46) where 0 is the Mahalanobis 
distance between the two populations. 


<— Classification of two normal populations with different covariance matrices 
(ML rule) leads to regions defined by a quadratic function. 


<> Desirable discriminant rules have a low expected cost of misclassification 
(ECM). 
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12.2 Discrimination Rules in Practice 


The ML rule is used if the distribution of the data is known up to parameters. Suppose for 
example that the data come from multivariate normal distributions N,(;,u). If we have J 
groups with n; observations in each group, we use Z; to estimate ju;, and S; to estimate ». 
The common covariance may be estimated by 


J S; 
j=l 
: J 
with n = >> 


<1 2j- Thus the empirical version of the ML rule of Theorem 12.2 is to allocate 
a new observation x to I]; such that 7 minimizes 


(a —2%,)'S>\(a—z,;) for ie {1,...,J}. 


EXAMPLE 12.4 Let us apply this rule to the Swiss bank notes. The 20 randomly chosen 
bank notes which we had clustered into two groups in Example 11.6 are used. First the 
covariance & is estimated by the average of the covariances of II, (cluster 1) and Ilz (cluster 
2). The hyperplane @' (x —¥) =0 which separates the two populations is given by 


@ = S-'(e, — 2X) = (—12.18, 20.54, —19.22, —15.55, —13.06, 21.43)", 
1 
T= 5 (Fi + Ta) = (214.79, 130.05, 129.92, 9.23, 10.48, 140.46)". 


Now let us apply the discriminant rule to the entire bank notes data set. Counting the number 
of misclassifications by 

100 200 

S > Ha" (2; — £) < 0}, 5 1G" (a; — £) > 0}, 

i=1 i=101 
we obtain 1 misclassified observation for the conterfeit bank notes and 0 misclassification for 
the genuine bank notes. 


When J = 3 groups, the allocation regions can be calculated using 


hyo(zx) = (X1 = £2) ie {« = (i + ¥2) 


hig(t) = (%, —%3)'S,' {e- sm +m)} 


hog(x) = (2 —%3)'Sz" " = (Be + 3) 


The rule is to allocate x to 
Il, if hy(x)>O0 and hj3(x) > 0 
IIy ay hyg(x) <O and ho3(x) > 0 
II; if hi3(zx) <O and ho3(x) < 0. 
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Estimation of the probabilities of misclassifications 


Misclassification probabilities are given by (12.7) and can be estimated by replacing the 
unknown parameters by their corresponding estimators. 


For the ML rule for two normal populations we obtain 


P ‘ 1s 
Piz = pa = ® (-53) 


where 6?= (1; — €2)'S-1(Z, — %) is the estimator for 6”. 


The probabilities of misclassification may also be estimated by the re-substitution method. 
We reclassify each original observation 7;, 7 = 1,---,n into II,,--- ,II, according to the 
chosen rule. Then denoting the number of individuals coming from I; which have been 
classified into II; by n;;, we have p,; = eo an estimator of p;;. Clearly, this method leads 
to too optimistic estimators of p;;, but it provides a rough measure of the quality of the 
discriminant rule. The matrix (p;;) is called the confussion matrix in Johnson and Wichern 
(1998). 


EXAMPLE 12.5 Jn the above classification problem for the Swiss bank notes (Table B.2), 
we have the following confussion matrix: 


true membership 


genuine (II,) counterfeit (II,) 
Il, 100 1 


predicted 


II. 0 99 


Q MVAaper.xpl 


The apparent error rate (APER) is defined as the fraction of observations that are misclas- 
sified. The APER, expressed as a percentage, is 


1 
APER = ( —— | 100% = 0.5%. 
(= a 2 


For the calculation of the APER we use the observations twice: the first time to construct 
the classification rule and the second time to evaluate this rule. An APER of 0.5% might 
therefore be too optimistic. An approach that corrects for this bias is based on the holdout 
procedure of Lachenbruch and Mickey (1968). For two populations this procedure is as 
follows: 
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1. Start with the first population I];. Omit one observation and develop the classification 
rule based on the remaining n; — 1, nz observations. 


2. Classify the “holdout” observation using the discrimination rule in Step 1. 


3. Repeat steps 1 and 2 until all of the II; observations are classified. Count the number 
n>, of misclassified observations. 


4. Repeat steps 1 through 3 for population II,. Count the number n{, of misclassified 
observations. 


Estimates of the misclassification probabilities are given by 


a __ ‘hy 
Pi2 = —_ 
ne 
and ; 
i, = No 
21 
M4 


A more realistic estimator of the actual error rate (AER) is given by 


Met Ma (12.10) 


ner ny 


Statisticians favor the AER (for its unbiasedness) over the APER. In large samples, however, 
the computational costs might counterbalance the statistical advantage. This is not a real 
problem since the two misclassification measures are asymptotically equivalent. 
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Fisher’s linear discrimination function 


Another approach stems from R. A. Fisher. His idea was to base the discriminant rule on a 
projection a'a such that a good separation was achieved. This LDA projection method is 
called Fisher’s linear discrimination function. If 

Y=Xa 
denotes a linear combination of observations, then the total sum of squares of y, 5>;"_,(yi—y)”, 


is equal to 7 
WHY =a'X'HXa=a'Ta (12.11) 


with the centering matrix H =Z—n11,1/) and T = X'HA. 


Suppose we have samples 4, 7 = 1,...,J, from J populations. Fisher’s suggestion was 
to find the linear combination a'x which maximizes the ratio of the between-group-sum of 
squares to the within-group-sum of squares. 
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The within-group-sum of squares is given by 


J J 
SOV HG; = So a XH; Xja = a'Wa, (12.12) 


j=l j=l 


where Y; denotes the j-th sub-matrix of Y corresponding to observations of group j and 1, 
denotes the (n; x n;) centering matrix. The within-group-sum of squares measures the sum 
of variations within each group. 


The between-group-sum of squares is 
z Fi 
> 7G; — 9) = >) a fa" @; — ZY = a" Ba, (12.13) 
j=l j=l 


where 7, and 7; denote the means of Y; and 4; and y and Z denote the sample means of 
Y and X. The between-group-sum of squares measures the variation of the means across 
groups. 


The total sum of squares (12.11) is the sum of the within-group-sum of squares and the 
between-group-sum of squares, i.e., 


a'Ta=a'Wa+a'Ba. 
Fisher’s idea was to select a projection vector a that maximizes the ratio 


a! Ba 


aR (12.14) 


The solution is found by applying Theorem 2.5. 


THEOREM 12.4 The vector a that maximizes (12.14) is the eigenvector of W'B that 
corresponds to the largest eigenvalue. 


Now a discrimination rule is easy to obtain: 
classify x into group j where a'Z; is closest to a'z, ie., 


a — Il; where j = arg min |a' (x — 2;)|. 
7 


When J = 2 groups, the discriminant rule is easy to compute. Suppose that group 1 has n, 
elements and group 2 has nz elements. In this case 


B= (a 
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where d = (%, — 2). W~'B has only one eigenvalue which equals 
tr(W~tB) = (==) d’W-d, 
n 
and the corresponding eigenvector is a = W~'d. The corresponding discriminant rule is 


(Z1 + F2)} > 0, 
(Z1 + Z2)} <0. 


gl if al{x— 
xr — IIe if al{x— 


(12.15) 


1 
2 
i 
2 


The Fisher LDA is closely related to projection pursuit (Chapter 18) since the statistical 


technique is based on a one dimensional index a' x. 


EXAMPLE 12.6 Consider the bank notes data again. Let us use the subscript “gq” for the 
genuine and “f” for the conterfeit bank notes, e.g., ¥, denotes the first hundred observations 
of € and X; the second hundred. In the contect of the bank data set the “between-group-sum 
of squares” is defined as 


100{(G, — 9)’ + G, -9)?} =a'Ba (12.16) 
for some matrix B. Here, y, and y; denote the means for the genuine and counterfeit bank 
notes and Y = 5Vq +9). The “within-group-sum of squares” is 


100 100 


Df) - I} + YX uni —g)}? = aT Wa, (12.17) 


with (yg); = ala; and (yr); = a' 24100 fori =1,..., 100. 
g f 


The resulting discriminant rule consists of allocating an observation xo to the genuine sample 
space if 
a” (Xo = z) > 0, 


with a = W"'(%, — Fy) (see Exercise 12.8) and of allocating xo to the counterfeit sample 
space when the opposite is true. In our case 


a = (0.000, 0.029, —0.029, —0.039, —0.041, 0.054) " : 


One genuine and no counterfeit bank notes are misclassified. Figure 12.2 shows the estimated 
densities for y, = a'X, andy; =a' Xs. They are separated better than those of the diagonals 
in Figure 1.9. 


Note that the allocation rule (12.15) is exactly the same as the ML rule for J = 2 groups 
and for normal distributions with the same covariance. For J = 3 groups this rule will be 
different, except for the special case of collinear sample means. 
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Swiss bank notes 


densities of projections 


Forged 


Genuine 


Figure 12.2. Densities of projections of genuine and counterfeit bank 
notes by Fisher’s discrimination rule. Q MVAdisfbank.xpl 


m Summary 


<> A discriminant rule is a separation of the sample space into sets R;. An 
observation x is classified as coming from population IH, if it lies in R;. 


<+ The expected cost of misclassification (ECM) for two populations is given 
by ECM = C(2|1)poim + C(1|2)piom. 


< The ML rule is applied if the distributions in the populations are known 
up to parameters, e.g., for normal distributions N,(j1;, 4). 
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Summary (continued) 


< The ML rule allocates x to the population that exhibits the smallest Ma- 
halanobis distance 


5° (a; us) = (a — pa) Ea — pn). 


< The probability of misclassification is given by 


1 
Pi2 = pa = (-35) ‘ 


where 6 is the Mahalanobis distance between 4; and [l2. 


<> Classification for different covariance structures in the two populations 
leads to quadratic discrimination rules. 


<> A different approach is Fisher’s linear discrimination rule which finds a 
linear combination a'z that maximizes the ratio of the “between-group- 
sum of squares” and the “within-group-sum of squares”. This rule turns 
out to be identical to the ML rule when J = 2 for normal populations. 


12.3. Boston Housing 


One interesting application of discriminant analysis with respect to the Boston housing data 
is the classification of the districts according to the house values. The rationale behind this 
is that certain observables must determine the value of a district, as in Section 3.7 where 
the house value was regressed on the other variables. Two groups are defined according to 
the median value of houses Xj4: in group I]; the value of Nici is greater than_or equal to the 
median of a and in group Iz the value of Kai is less than the median of Dae 


The linear discriminant rule, defined on the remaining 12 variables (excluding Xi and X 14) is 
applied. After reclassifying the 506 observations, we obtain an apparent error rate of 0.146. 
The details are given in Table 12.3. The more appropriate error rate, given by the AER, is 
0.160 (see Table 12.4). 


Let us now turn to a group definition suggested by the Cluster Analysis in Section 11.4. 
Group II, was defined by higher quality of life and house. We define the linear discriminant 
rule using the 13 variables from ¥ excluding X4. Then we reclassify the 506 observations 
and we obtain an APER of 0.0395. Details are summarized in Table 12.5. The AER turns 
out to be 0.0415 (see Table 12.6). 
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True 
II, IL 
II, | 216 40 
Predicted 
II, | 34 216 


Table 12.3. APER for price of Boston houses. @ MVAdiscbh.xpl 


True 
II, IL 
II, | 211 42 
Predicted 
II, | 39 214 


Table 12.4. AER for price of Boston houses. Q MVAaerbh.xpl 


True 
II, IL 
II, | 244 13 
Predicted 
II,| 7 242 


True 
II, IIb, 
II, | 244 14 
Predicted 
IIl,) 7 241 


Table 12.6. AER for clusters of Boston houses. @ MVAaerbh.xpl 


Figure 12.3 displays the values of the linear discriminant scores (see Theorem 12.2) for all 
of the 506 observations, colored by groups. One can clearly see the APER is derived from 
the 7 observations from group II; with a negative score and the 13 observations from group 
Il, with positive score. 
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Atay ad, A, " 4 
- A Meielins ots Ayre does 
aR aw Woy pi a aa 4 


Figure 12.3. Discrimination scores for the two clusters created from the 
Boston housing data. @Q MVAdiscbh.xpl 


12.4 Exercises 
EXERCISE 12.1 Prove Theorem 12.2 (a) and 12.2 (b). 


EXERCISE 12.2 Apply the rule from Theorem 12.2 (b) for p = 1 and compare the result 
with that of Example 12.3. 


EXERCISE 12.3 Calculate the ML discrimination rule based on observations of a one- 
dimensional variable with an exponential distribution. 


EXERCISE 12.4 Calculate the ML discrimination rule based on observations of a two- 
dimensional random variable, where the first component has an exponential distribution and 
the other has an alternative distribution. What is the difference between the discrimination 
rule obtained in this exercise and the Bayes discrimination rule? 


EXERCISE 12.5 Apply the Bayes rule to the car data (Table B.3) in order to discriminate 
between Japanese, European and U.S. cars, t.e., J = 3. Consider only the “miles per gallon” 
variable and take the relative frequencies as prior probabilities. 


EXERCISE 12.6 Compute Fisher’s linear discrimination function for the 20 bank notes 
from Example 11.6. Apply it to the entire bank data set. How many observations are mis- 
classified? 


EXERCISE 12.7 Use the Fisher’s linear discrimination function on the WAIS data set (Ta- 
ble B.12) and evaluate the results by re-substitution the probabilities of misclassification. 
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EXERCISE 12.8 Show that in Example 12.6 


(a) W =100(S,+ Sy), where S, and Ss denote the empirical covariances (3.6) and (3.5) 
w.r.t. the genuine and counterfeit bank notes, 


(6) B= 100 { (Z, —)(% —Z)' + (&; —F)(Fy —z)'}, where = = 5(Z,+Zy), 
(c) a=W" (&, —- =). 


EXERCISE 12.9 Recalculate Example 12.3 with the prior probability ™ = ; and C'(2|1) = 
2C(1|2). 


EXERCISE 12.10 Explain the effect of changing ™ or C(1|2) on the relative location of the 
region Rj,j = 1,2. 


EXERCISE 12.11 Prove that Fisher’s linear discrimination function is identical to the ML 
rule when the covariance matrices are identical (J = 2). 


EXERCISE 12.12 Suppose that x € {0,1,2,3,4,5,6,7,8,9,10} and 


Il, : X ~ Bi(10,0.2) with the prior probability 7, = 0.5; 
I, : X ~ Bi(10,0.3) with the prior probability 72 = 0.3; 
Iz : X ~ Bi(10,0.5) with the prior probability 73 = 0.2. 


Determine the sets Ri, Ry and R3. (Use the Bayes discriminant rule. ) 


13 Correspondence Analysis 


Correspondence analysis provides tools for analyzing the associations between rows and 
columns of contingency tables. A contingency table is a two-entry frequency table where 
the joint frequencies of two qualitative variables are reported. For instance a (2 x 2) table 
could be formed by observing from a sample of n individuals two qualitative variables: the 
individual’s sex and whether the individual smokes. The table reports the observed joint 
frequencies. In general (n x p) tables may be considered. 


The main idea of correspondence analysis is to develop simple indices that will show the rela- 
tions between the row and the columns categories. These indices will tell us simultaneously 
which column categories have more weight in a row category and vice-versa. Correspondence 
analysis is also related to the issue of reducing the dimension of the table, similar to principal 
component analysis in Chapter 9, and to the issue of decomposing the table into its factors 
as discussed in Chapter 8. The idea is to extract the indices in decreasing order of impor- 
tance so that the main information of the table can be summarized in spaces with smaller 
dimensions. For instance, if only two factors (indices) are used, the results can be shown in 
two-dimensional graphs, showing the relationship between the rows and the columns of the 
table. 


Section 13.1 defines the basic notation and motivates the approach and Section 13.2 gives the 
basic theory. The indices will be used to describe the y? statistic measuring the associations 
in the table. Several examples in Section 13.3 show how to provide and interpret, in practice, 
the two-dimensional graphs displaying the relationship between the rows and the columns 
of a contingency table. 


13.1 Motivation 


The aim of correspondence analysis is to develop simple indices that show relations between 
the row and columns of a contingency tables. Contingency tables are very useful to describe 
the association between two variables in very general situations. The two variables can be 
qualitative (nominal), in which case they are also referred to as categorical variables. Each 
row and each column in the table represents one category of the corresponding variable. 
The entry x;; in the table Y (with dimension (n x p)) is the number of observations in a 
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sample which simultaneously fall in the i-th row category and the j-th column category, for 
i=1,...,n andj =1,...,p. Sometimes a “category” of a nominal variable is also called a 
“modality” of the variable. 


The variables of interest can also be discrete quantitative variables, such as the number of 
family members or the number of accidents an insurance company had to cover during one 
year, etc. Here, each possible value that the variable can have defines a row or a column 
category. Continuous variables may be taken into account by defining the categories in terms 
of intervals or classes of values which the variable can take on. Thus contingency tables can 
be used in many situations, implying that correspondence analysis is a very useful tool in 
many applications. 


The graphical relationships between the rows and the columns of the table Y that result 
from correspondence analysis are based on the idea of representing all the row and column 
categories and interpreting the relative positions of the points in terms of the weights corre- 
sponding to the column and the row. This is achieved by deriving a system of simple indices 
providing the coordinates of each row and each column. These row and column coordinates 
are simultaneously represented in the same graph. It is then clear to see which column 
categories are more important in the row categories of the table (and the other way around). 


As was already eluded to, the construction of the indices is based on an idea similar to 
that of PCA. Using PCA the total variance was partitioned into independent contributions 
stemming from the principal components. Correspondence analysis, on the other hand, de- 
composes a measure of association, typically the total y? value used in testing independence, 
rather than decomposing the total variance. 


EXAMPLE 13.1 The French “baccalauréat” frequencies have been classified into regions 
and different baccalauréat categories, see Appendix, Table B.8. Altogether n = 202100 bac- 
calauréats were observed. The joint frequency of the region Ile-de-France and the modality 
Philosophy, for example, is 9724. That is, 9724 baccalauréats were in Ile-de-France and the 
category Philosophy. 


The question is whether certain regions prefer certain baccalauréat types. If we consider, for 
instance, the region Lorraine, we have the following percentages: 


A B C D E F G H 


20.5 7.6 15.3 19.6 3.4 14.5 18.9 0.2 


The total percentages of the different modalities of the variable baccalauréat are as follows: 


A B C D E F G H 


22.6 10.7 16.2 22.8 2.6 o.f 15.2 0.2 
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One might argue that the region Lorraine seems to prefer the modalities E, F, G and dislike 
the specializations A, B, C, D relative to the overall frequency of baccalauréat type. 


In correspondence analysis we try to develop an index for the regions so that this over- or 
underrepresentation can be measured in just one single number. Simultaneously we try to 
weight the regions so that we can see in which region certain baccalauréat types are preferred. 


EXAMPLE 13.2 Consider n types of companies and p locations of these companies. Is there 
a certain type of company that prefers a certain location? Or is there a location index that 
corresponds to a certain type of company? 


Assume that n = 3, p = 3, and that the frequencies are as follows: 


40 2 <— Finance 
x = 011 <— Energy 
114 < HiTech 


T Frankfurt 
ii Berlin 
T Munich 


The frequencies imply that four type 3 companies (HiTech) are in location 8 (Munich), and 
so on. Suppose there is a (company) weight vector r = (ri,...,T%) such that a location 
index s; could be defined as 


sj =e) ri , (13.1) 


where 4; = Y.;_, 2; 1s the number of companies in location j and c is a constant. 81, 
for example, would give the average weighted frequency (by r) of companies in location 1 
(Frankfurt). 


; ; T ; 
Given a location weight vector s* = (Si, sik =) , we can define a company index in the 
same way as 

v 
rac y — (13.2) 
—_ Vie 
j=l 


where c* is a constant and Vie = at x4; 1s the sum of the i-th row of &, 1.e., the number 
of type i companies. Thus r3, for example, would give the average weighted frequency (by s*) 
of energy companies. 


If (13.1) and (13.2) can be solved simultaneously for a “row weight” vector r = (r1,..-,Tn)! 
and a “column weight” vector s = (s1,...,8p)', we may represent each row category by 
rj, 0=1,...,n and each column category by s;, 7 = 1,...,p in a one-dimensional graph. If 
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in this graph r; and s; are in close proximity (far from the origin), this would indicate that 
the i-th row category has an important conditional frequency 2;;/2,; in (13.1) and that the 
j-th column category has an important conditional frequency 2;;/xj. in (13.2). This would 
indicate a positive association between the i-th row and the j-th column. A similar line of 
argument could be used if r; was very far away from s, (and far from the origin). This would 
indicate a small conditional frequency contribution, or a negative association between the 
i-th row and the j-th column. 


na Summary 


<— The aim of correspondence analysis is to develop simple indices that show 
relations among qualitative variables in a contingency table. 


<— The joint representation of the indices reveals relations among the vari- 
ables. 


13.2 \? Decomposition 


An alternative way of measuring the association between the row and column categories is 
a decomposition of the value of the .?-test statistic. The well known y?-test for indepen- 
dence in a two-dimensional contingency table consists of two steps. First the expected value 
of each cell of the table is estimated under the hypothesis of independence. Second, the 
corresponding observed values are compared to the expected values using the statistic 


np 
i=1 j=1 
where x;; is the observed frequency in cell (7,7) and E;; is the corresponding estimated 
expected value under the assumption of independence, i.e., 
Lie Lej 


Lee 


Here 2. = )-\_, Vie. Under the hypothesis of independence, t has a on, =) distribution. 
In the industrial location example introduced above the value of t = 6.26 is almost significant 
at the 5% level. It is therefore worth investigating the special reasons for departure from 
independence. 


The method of y? decomposition consists of finding the SVD of the matrix C (n x p) with 
elements 
1/2 
cay = (wiz — By) / EAP? (13.5) 
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The elements c;; may be viewed as measuring the (weighted) departure between the observed 
x;; and the theoretical values £;; under independence. This leads to the factorial tools of 
Chapter 8 which describe the rows and the columns of C. 


For simplification define the matrics A(n x n) and B(p x p) as 
A = diag(x;.) and B = diag(x.,;). (13.6) 


These matrices provide the marginal row frequencies a(n x 1) and the marginal column 
frequencies b(p x 1): 
a= Al, and 6= 61,. (13.7) 


It is easy to verify that 
CVb=0 and C' Va=0, (13.8) 
where the square root of the vector is taken element by element and R = rank(C) < min{(n— 
1),(p — 1)}. From (8.14) of Chapter 8, the SVD of C yields 
CATA, (13.9) 


where [ contains the eigenvectors of CC', A the eigenvectors of C'C and 
i= diag(\1/”, er eS) with A} > Ap >... > Ag (the eigenvalues of CC'). Equation (13.9) 
implies that 


R 
cig = oO in Sie (13.10) 
k=1 
Note that (13.3) can be rewritten as 
R np 
HCO" )= > a= yy et (13.11) 
k=1 i=1 j=l 


This relation shows that the SVD of C decomposes the total y? value rather than, as in 
Chapter 8, the total variance. 


The duality relations between the row and the column space (8.11) are now fork =1,...,R 
given by 

Op = Cg, 

BT Vie I (13.12) 

The projections of the rows and the columns of C are given by 

Con = VARYEs 

abe Fall 
C' +p =v AKOk- ( ) 


Note that the eigenvectors satisfy 
Vb=0, Ff Va=0. (13.14) 
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From (13.10) we see that the eigenvectors 6; and y, are the objects of interest when analyzing 
the correspondence between the rows and the columns. Suppose that the first eigenvalue in 
(13.10) is dominant so that 

Cig & Ny vi dj1- (13.15) 


In this case when the coordinates y;; and 6;; are both large (with the same sign) relative to 
the other coordinates, then c;; will be large as well, indicating a positive association between 
the i-th row and the j-th column category of the contingency table. If y;, and 6;; were both 
large with opposite signs, then there would be a negative association between the i-th row 
and j-th column. 


In many applications, the first two eigenvalues, A; and A2, dominate and the percentage of 
the total y? explained by the eigenvectors 7; and 2 and 6, and 6. is large. In this case (13.13) 
and (71, ¥2) can be used to obtain a graphical display of the n rows of the table ((61, 62) play 
a similar role for the p columns of the table). The interpretation of the proximity between 
row and column points will be interpreted as above with respect to (13.10). 


In correspondence analysis, we use the projections of weighted rows of C and the projections 
of weighted columns of C for graphical displays. Let rg(n x 1) be the projections of A~1/?C 
on 6, and s;(p x 1) be the projections of B~'/?C' on 7, (k = 1,..., R): 


Te = AVC, = JAW, 


13.1 
Sh = BCT, = /\,B-/6,,. ( 3 6) 
These vectors have the property that 
er =O, 
Poo (13.17) 
The obtained projections on each axis k = 1,...,R are centered at zero with the natural 


weights given by a (the marginal frequencies of the rows of 4’) for the row coordinates r;, and 
by 6 (the marginal frequencies of the columns of ¥) for the column coordinates s; (compare 
this to expression (13.14)). As a result, the origin is the center of gravity for all of the 
representations. We also know from (13.16) and the SVD of C that 


ry Ar, = Ake 
&, Bsn = Agi 


(13.18) 


From the duality relation between 6; and y, (see (13.12)) we obtain 


Th = FA V?CB sy, 


k 13.19 
Sk = BBP CT AMP r,, ( ) 


which can be simplified to 


Th= SATS. 


Sk = BOK rp. 


(13.20) 
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These vectors satisfy the relations (13.1) and (13.2) for each k = 1,..., R simultaneously. 


As in Chapter 8, the vectors r;, and s; are referred to as factors (row factor and column 
factor respectively). They have the following means and variances: 


ws 1 ain 
—— r 0, 
‘4 aay 6 (13.21) 
and m 
Var(r,) = a voini Vie i = ua = re (13.22) 
sl s . 
Var(sx) = ay ha CayShy = “RY = OR. 


Hence, Ax/ y A;, which is the part of the k-th factor in the decomposition of the x? 
statistic t, may also be interpreted as the proportion of the variance explained by the factor 
k. The proportions 


2 
Viel p5 


Cir) = tot) = Lacey BS leet (13,23) 


k 
are called the absolute contributions of row 2 to the variance of the factor r,. They show 
which row categories are most important in the dispersion of the k-th row factors. Similarly, 


the proportions 
2 


Lei S.- 
C.(j, 8k) = — for j=1,...,p, k=1,...,R (13.24) 


are called the absolute contributions of column j to the variance of the column factor sz. 
These absolute contributions may help to interpret the graph obtained by correspondence 
analysis. 
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The graphical representations on the axes k = 1,2,..., R of the n rows and of the p columns 
of ¥ are provided by the elements of r, and s;,. Typically, two-dimensional displays are 
often satisfactory if the cumulated percentage of variance explained by the first two factors, 


Vy = AY. is sufficiently large. 
ee Ak 


The interpretation of the graphs may be summarized as follows: 


- The proximity of two rows (two columns) indicates a similar profile in these two rows 
(two columns), where “profile” referrs to the conditional frequency distribution of 
a row (column); those two rows (columns) are almost proportional. The opposite 
interpretation applies when the two rows (two columns) are far apart. 
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- The proximity of a particular row to a particular column indicates that this row (col- 
umn) has a particularly important weight in this column (row). In contrast to this, 
a row that is quite distant from a particular column indicates that there are almost 
no observations in this column for this row (and vice versa). Of course, as mentioned 
above, these conclusions are particularly true when the points are far away from 0. 


- The origin is the average of the factors r;, and s,;. Hence, a particular point (row or 
column) projected close to the origin indicates an average profile. 


- The absolute contributions are used to evaluate the weight of each row (column) in 
the variances of the factors. 


- All the interpretations outlined above must be carried out in view of the quality of the 
graphical representation which is evaluated, as in PCA, using the cumulated percentage 
of variance. 


REMARK 13.1 Note that correspondence analysis can also be applied to more general 
(n x p) tables Y which in a “strict sense” are not contingency tables. 


As long as statistical (or natural) meaning can be given to sums over rows and columns, 
Remark 13.1 holds. This implies, in particular, that all of the variables are measured in the 
same units. In that case, x.. constitutes the total frequency of the observed phenomenon, and 
is shared between individuals (n rows) and between variables (p columns). Representations 
of the rows and columns of V, r;, and s;, have the basic property (13.19) and show which 
variables have important weights for each individual and vice versa. This type of analysis is 
used as an alternative to PCA. PCA is mainly concerned with covariances and correlations, 
whereas correspondence analysis analyzes a more general kind of association. (See Exercises 
13.3 and 13.11.) 


EXAMPLE 13.3 A survey of Belgium citizens who regularly read a newspaper was conducted 
in the 1980’s. They were asked where they lived. The possible answers were 10 regions: 7 
provinces (Antwerp, Western Flanders, Eastern Flanders, Hainant, Liége, Limbourg, Luz- 
embourg) and 8 regions around Brussels (Flemish-Brabant, Wallon-Brabant and the city of 
Brussels). They were also asked what kind of newspapers they read on a regular basis. There 
were 15 possible answers split up into 3 classes: Flemish newspapers (label begins with the 
letter v), French newspapers (label begins with f) and both languages together (label begins 
with b). The data set is given in Table B.9. The eigenvalues of the factorial correspondence 
analysis are given in Table 18.2. 


Two-dimensional representations will be quite satisfactory since the first two eigenvalues 
account for 81% of the variance. Figure 13.1 shows the projections of the rows (the 15 
newspapers) and of the columns (the 10 regions). 
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A; percentage of variance cumulated percentage 


183.40 0.653 0.653 
43.75 0.156 0.809 
25.21 0.090 0.898 
11.74 0.042 0.940 

8.04 0.029 0.969 
4.68 0.017 0.985 
2.13 0.008 0.993 
1,20 0.004 0.997 
0.82 0.003 1.000 
0.00 0.000 1.000 


Table 13.2. Eigenvalues and percentages of the variance (Example 13.3) . 


journal data 


r_2,s_2 


Figure 13.1. Projection of rows (the 15 newspapers) and columns (the 10 
regions) Q@ MVAcorrjourn.xpl 
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O49) Calta)  Galizts) 

Ua 0.0563 0.0008 0.0036 
Up 0.1555 0.5567 0.0067 
Ue 0.0244 0.1179 0.0266 
Va 0.1352 0.0952 0.0164 
Ve 0.0253 0.1193 0.0013 
fr 0.0314 0.0183 0.0597 
Te 0.0585 0.0162 0.0122 
th 0.1086 0.0024 0.0656 
fi 0.1001 0.0024 0.6376 
b; 0.0029 0.0055 0.0187 
br 0.0236 0.0278 0.0237 
b 0.0006 0.0090 0.0064 
Ving 0.1000 0.0038 0.0047 
Tu 0.0966 0.0059 0.0269 
fo 0.0810 0.0188 0.0899 
Total 1.0000 1.0000 1.0000 


Table 13.3. Absolute contributions of row factors rz. 


As expected, there is a high association between the regions and the type of newspapers which 
is read. In particular, vy (Gazet van Antwerp) is almost exclusively read in the province of 
Antwerp (this is an extreme point in the graph). The points on the left all belong to Flanders, 
whereas those on the right all belong to Wallonia. Notice that the Wallon-Brabant and the 
Flemish-Brabant are not far from Brussels. Brussels is close to the center (average) and also 
close to the bilingual newspapers. It is shifted a little to the right of the origin due to the 
majority of French speaking people in the area. 


The absolute contributions of the first 3 factors are listed in Tables 13.3 and 13.4. The row 
factors r;, are in Table 18.3 and the column factors s, are in Table 13.4. 


They show, for instance, the important role of Antwerp and the newspaper vy in determining 
the variance of both factors. Clearly, the first axis expresses linguistic differences between 
the 3 parts of Belgium. The second axis shows a larger dispersion between the Flemish 
region than the French speaking regions. Note also that the 3-rd axis shows an important 
role of the category “f;” (other French newspapers) with the Wallon-Brabant “brw” and the 
Hainant “hai” showing the most important contributions. The coordinate of “f;” on this axis 
is negative (not shown here) so are the coordinates of “brw” and “hai”. Apparently, these 
two regions also seem to feature a greater proportion of readers of more local newspapers. 


EXAMPLE 13.4 Applying correspondence analysis to the French baccalauréat data (Ta- 
ble B.8) leads to Figure 13.2. Excluding Corsica we obtain Figure 13.3. The different 
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Ca(j,81) Ca(j, 82) CalJ, 83) 
brw 0.0887 0.0210 0.2860 
bxl 0.1259 0.0010 0.0960 
anv 0.2999 0.4349 0.0029 
brf 0.0064 0.2370 0.0090 
foc 0.0729 0.1409 0.0033 
for 0.0998 0.0023 0.0079 
hai 0.1046 0.0012 0.3141 
lig 0.1168 0.0355 0.1025 
lim 0.0562 0.1162 0.0027 
lux 0.0288 0.0101 0.1761 
Total 1.0000 1.0000 1.0000 


Table 13.4. Absolute contributions of column factors Sp. 


eigenvalues A percentage of variances cumulated percentage 


2436.2 0.5605 0.561 
1052.4 0.2421 0.803 
341.8 0.0786 0.881 
229.5 0.0528 0.934 
1o2.2 0.0350 0.969 
109.1 0.0251 0.994 
25.0 0.0058 1.000 

0.0 0.0000 1.000 


Table 13.5. Eigenvalues and percentages of explained variance (including 
Corsica). 


modalities are labeled A, ..., H and the regions are labeled ILDF, ..., CORS. The results of the 
correspondence analysis are given in Table 13.5 and Figure 13.2. 


The first two factors explain 80 % of the total variance. It is clear from Figure 13.2 that 
Corsica (in the upper left) is an outlier. The analysis is therefore redone without Corsica 
and the results are given in Table 13.6 and Figure 13.3. Since Corsica has such a small 
weight in the analysis, the results have not changed much. 


The projections on the first three axes, along with their absolute contribution to the variance 
of the axis, are summarized in Table 13.7 for the regions and in Table 13.8 for baccalauréats. 


The interpretation of the results may be summarized as follows. Table 13.8 shows that 
the baccalauréats B on one side and F on the other side are most strongly responsible for 
the variation on the first axis. The second axis mostly characterizes an opposition between 
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baccalaureat data 


“ 
ci 
“! 
| bnor Pay! 
cham me 
lichor 
-0.2 -0.1 0 0.1 0.2 
r_l,s_1 
Figure 13.2. Correspondence analysis including Corsica 


Q MVAcorrbac.xpl 


eigenvalues A percentage of variances cumulated percentage 


2408.6 0.5874 0.587 
909.5 0.2218 0.809 
318.5 0.0766 0.887 
195.9 0.0478 0.935 
149.3 0.0304 0.971 
96.1 0.0234 0.994 
22.8 0.0056 1.000 

0.0 0.0000 1.000 


Table 13.6. Eigenvalues and percentages of explained variance (excluding 
Corsica). 
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baccalaureat data 


Figure 13.38. Correspondence analysis excluding Corsica. 
Q MVAcorrbac.xpl 


baccalauréats A and C. Regarding the regions, Ile de France plays an important role on each 
axis. On the first axis, it is opposed to Lorraine and Alsace, whereas on the second axis, it 
is opposed to Poitou-Charentes and Aquitaine. All of this is confirmed in Figure 13.3. 


On the right side are the more classical baccalauréats and on the left, more technical ones. 
The regions on the left side have thus larger weights in the technical baccalauréats. Note also 
that most of the southern regions of France are concentrated in the lower part of the graph 
near the baccalauréat A. 


Finally, looking at the 3-rd axis, we see that it is dominated by the baccalauréat E (negative 
sign) and to a lesser degree by H (negative) (as opposed to A (positive sign)). The domi- 
nating regions are HNOR (positive sign), opposed to NOPC and AUVE (negative sign). For 
instance, HNOR is particularly poor in baccalauréat D. 
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Region ry ro r3 Ga(t,11) Galt,f2e) Gal,rs) 
ILDF 0.1464 0.0677 0.0157 0.3839 0.2175 0.0333 
CHAM -0.0603 -0.0410 -0.0187 0.0064 0.0078 0.0047 
PICA 0.0323 -0.0258 -0.0318 0.0021 0.0036 0.0155 
HNOR_ -0.0692 0.0287 0.1156 0.0096 0.0044 0.2035 
CENT -0.0068 -0.0205 -0.0145 0.0001 0.0030 0.0043 
BNOR ~ -0.0271 -0.0762 0.0061 0.0014 0.0284 0.0005 
BOUR -0.1921 0.0188 0.0578 0.0920 0.0023 0.0630 
NOPC -0.1278 0.0863 -0.0570 0.0871 0.1052 0.1311 
LORR ~ -0.2084 0.0511 0.0467 0.1606 0.0256 0.0608 
ALSA — -0.2331 0.0838 0.0655 0.1283 0.0439 0.0767 
FRAC -0.1304 -0.0368 -0.0444 0.0265 0.0056 0.0232 
PAYL ~~ -0.0743 -0.0816 -0.0341 0.0232 0.0743 0.0370 
BRET 0.0158 0.0249 -0.0469 0.0011 0.0070 0.0708 
PCHA -0.0610 -0.1391 -0.0178 0.0085 0.1171 0.0054 
AQUI 0.0368 -0.1183 0.0455 0.0055 0.1519 0.0643 
MIDI 0.0208 -0.0567 0.0138 0.0018 0.0359 0.0061 
LIMO — -0.0540 0.0221 -0.0427 0.0033 0.0014 0.0154 
RHOA  -0.0225 0.0273 -0.0385 0.0042 0.0161 0.0918 
AUVE — 0.0290 -0.0139 -0.0554 0.0017 0.0010 0.0469 
LARO 0.0290 -0.0862 -0.0177 0.0383 0.0595 0.0072 
PROV 0.0469 -0.0717 0.0279 0.0142 0.0884 0.0383 


Table 13.7. Coefficients and absolute contributions for regions, Exam- 
ple 13.4. 


Baccal S| 89 83 Calj, $1) Calj, $2) Cay, 83) 
A 0.0447 -0.0679 0.0367 0.0376 0.2292 0.1916 
B 0.1389 0.0557 0.0011 0.1724 0.0735 0.0001 
C 0.0940 0.0995 0.0079 0.1198 0.3556 0.0064 
D 0.0227 -0.0495 -0.0530 0.0098 0.1237 0.4040 
EK -0.1932 0.0492 -0.1317 0.0825 0.0141 0.2900 
F 
G 
H 


-0.2156 0.0862 0.0188 0.3793 0.1608 0.0219 
-0.1244 -0.0353 0.0279 0.1969 0.0421 0.0749 
-0.0945 0.0438 -0.0888 0.0017 0.0010 0.0112 


Table 13.8. Coefficients and absolute contributions for baccalauréats, Ex- 
ample 13.4. 


EXAMPLE 13.5 The U.S. crime data set (Table B.10) gives the number of crimes in the 
50 states of the U.S. classified in 1985 for each of the following seven categories: murder, 
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A; percentage of variance cumulated percentage 


4399.0 0.4914 0.4914 
2213.6 0.2473 0.7387 
1382.4 0.1544 0.8932 
870.7 0.0973 0.9904 
51.0 0.0057 0.9961 
34.8 0.0039 1.0000 
0.0 0.0000 0.0000 


Table 13.9. Eigenvalues and explained proportion of variance, Exam- 
ple 13.5. 


uscrime data 


r2,s 2 


Figure 13.4. Projection of rows (the 50 states) and columns (the 7 crime 
categories). Q MVAcorrcrime.xpl 


rape, robbery, assault, burglary, larceny and auto-theft. The analysis of the contingency table, 
limited to the first two factors, provides the following results (see Table 13.9). 
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Looking at the absolute contributions (not reproduced here, see Exercise 13.6), it appears that 
the first axis is robbery (+) versus larceny (-) and auto-theft (-) axis and that the second 
factor contrasts assault (-) to auto-theft (+). The dominating states for the first axis are the 
North-Eastern States MA (+) and NY (+) constrasting the Western States WY (-)and ID 
(-). For the second axis, the differences are seen between the Northern States (MA (+) and 
RI (+)) and the Southern States AL (-), MS (-) and AR (-). These results can be clearly 
seen in Figure 13.4 where all the states and crimes are reported. The figure also shows in 
which states the proportion of a particular crime category is higher or lower than the national 
average (the origin). 


Biplots 


The biplot is a low-dimensional display of a data matrix ¥ where the rows and columns 
are represented by points. The interpretation of a biplot is specifically directed towards the 
scalar products of lower dimensional factorial variables and is designed to approximately 
recover the individual elements of the data matrix in these scalar products. Suppose that 
we have a (10 x 5) data matrix with elements x;;. The idea of the biplot is to find 10 row 
points g; € R* (k < p,i =1,...,10) and 5 column points t; € R* (j = 1,...,5) such that 
the 50 scalar products between the row and the column vectors closely approximate the 50 
corresponding elements of the data matrix VY. Usually we choose k = 2. For example, the 
scalar product between g7 and t4 should approximate the data value x74 in the seventh row 
and the fourth column. In general, the biplot models the data x,;; as the sum of a scalar 
product in some low-dimensional subspace and a residual “error” term: 


tig = Qty +ey 
So dixtin + ei. (13.25) 
k 


To understand the link between correspondence analysis and the biplot, we need to introduce 
a formula which expresses x;; from the original data matrix (see (13.3)) in terms of row and 
column frequencies. One such formula, known as the “reconstitution formula”, is (13.10): 


R 3 
it Dy k=1 Ni Vik jhe 
[ Viele; 


Consider now the row profiles 7;;/xj. (the conditional frequencies) and the average row profile 
Lie/Lee- From (13.26) we obtain the difference between each row profile and this average: 


R 
vii Lie Z Lei 
SS SS ) APY; Oxr- 27 
(#2 = kel br ( x ) ae ( ) 


Lieleo 
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By the same argument we can also obtain the difference between each column profile and 
the average column profile: 


R 
Lig Le; 3 Lie 
SS NV Ox. 13.28 
(= ™) ». in ( =) u ( ) 


Now, if Ay >> Ag > A3..., we can approximate these sums by a finite number of K terms 
(usually K = 2) using (13.16) to obtain 


K 
Lig Lie L ei 
cs HO = ) l ki S po Opes 13.29 
& 4 k=l ( V ApLee : ) Ki 7 ( ) 


K 
Lij Lej Ve; ! 
(= =) 2 ( V Apes u) : 2 ( ) 
/ 


where e,; and e;, are error terms. (13.30) shows that if we consider displaying the differences 
between the row profiles and the average profile, then the projection of the row profile r; 
and a rescaled version of the projections of the column profile s; constitute a biplot of these 
differences. (13.29) implies the same for the differences between the column profiles and this 
average. 


an Summary 


<+ Correspondence analysis is a factorial decomposition of contingency ta- 
bles. The p-dimensional individuals and the n-dimensional variables can 
be graphically represented by projecting onto spaces of smaller dimension. 


<+ The practical computation consists of first computing a spectral decom- 
position of A~'Y¥B-!X' and B-!¥'A7!X which have the same first p 
eigenvalues. The graphical representation is obtained by plotting Air, 
vs. VAore and W181 vs. V/A282. Both plots maybe displayed in the same 
graph taking into account the appropriate orientation of the eigenvectors 
Ti, Sj. 


<+ Correspondence analysis provides a graphical display of the association 
measure cj; = (a3 — Ej)? / Ei. 


<> Biplot is a low-dimensional display of a data matrix where the rows and 
columns are represented by points 
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13.4 Exercises 


EXERCISE 13.1 Show that the matrices A'X¥ BX! andB!X'A 2X have an eigenvalue 
equal to 1 and that the corresponding eigenvectors are proportional to (1,...,1)". 


EXERCISE 13.2 Verify the relations in (13.8), (13.14) and (13.17). 


EXERCISE 13.3 Do a correspondence analysis for the car marks data (Table B.7)! Explain 
how this table can be considered as a contingency table. 


EXERCISE 13.4 Compute the y?-statistic of independence for the French baccalauréat data. 
EXERCISE 13.5 Prove that C = A~/?(X — E)B-‘/?,/a.. and E = abt and verify (13.20). 
EXERCISE 13.6 Do the full correspondence analysis of the U.S. crime data (Table B.10), 
and determine the absolute contributions for the first three axes. How can you interpret the 


third axis? Try to identify the states with one of the four regions to which it belongs. Do 
you think the four regions have a different behavior with respect to crime? 


EXERCISE 13.7 Repeat Exercise 13.6 with the U.S. health data (Table B.16). Only analyze 
the columns indicating the number of deaths per state. 


EXERCISE 13.8 Consider a (nxn) contingency table being a diagonal matrix X. What do 
you expect the factors rz, 8% to be like? 


EXERCISE 13.9 Assume that after some reordering of the rows and the columns, the con- 
tingency table has the following structure: 


Ji | Jo 
x= i, ok 
In| 0 


That is, the rows I; only have weights in the columns J;, fori = 1,2. What do you expect 
the graph of the first two factors to look like? 


EXERCISE 13.10 Redo Exercise 13.9 using the following contingency table: 


Ji | Jo | J3 

i, ok 0 0 
ae Ts 0 ok 
I3| 0 | 0 
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EXERCISE 13.11 Consider the French food data (Table B.6). Given that all of the vari- 
ables are measured in the same units (Francs), explain how this table can be considered as 
a contingency table. Perform a correspondence analysis and compare the results to those 
obtained in the NPCA analysis in Chapter 9. 


14 Canonical Correlation Analysis 


Complex multivariate data structures are better understood by studying low-dimensional 
projections. For a joint study of two data sets, we may ask what type of low-dimensional 
projection helps in finding possible joint structures for the two samples. The canonical 
correlation analysis is a standard tool of multivariate statistical analysis for discovery and 
quantification of associations between two sets of variables. 


The basic technique is based on projections. One defines an index (projected multivariate 
variable) that maximally correlates with the index of the other variable for each sample sep- 
arately. The aim of canonical correlation analysis is to maximize the association (measured 
by correlation) between the low-dimensional projections of the two data sets. The canonical 
correlation vectors are found by a joint covariance analysis of the two variables. The tech- 
nique is applied to a marketing examples where the association of a price factor and other 
variables (like design, sportiness etc.) is analysed. Tests are given on how to evaluate the 
significance of the discovered association. 


14.1 Most Interesting Linear Combination 


The associations between two sets of variables may be identified and quantified by canonical 
correlation analysis. The technique was originally developed by Hotelling (1935) who ana- 
lyzed how arithmetic speed and arithmetic power are related to reading speed and reading 
power. Other examples are the relation between governmental policy variables and economic 
performance variables and the relation between job and company characteristics. 


Suppose we are given two random variables X € R’% and Y € R?. The idea is to find an 
index describing a (possible) link between X and Y. Canonical correlation analysis (CCA) 
is based on linear indices, i.e., linear combinations 


a'X and bd'Y 


of the random variables. Canonical correlation analysis searches for vectors a and 6 such 
that the relation of the two indices a'x and b'y is quantified in some interpretable way. 
More precisely, one is looking for the “most interesting” projections a and 6 in the sense that 
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they maximize the correlation 
p(a, b) = patxety (14.1) 


between the two indices. 


Let us consider the correlation p(a,b) between the two projections in more detail. Suppose 


that 
Xx a bh Uxx Uxy 
Y V , Myx Nyy 
where the sub-matrices of this covariance structure are given by 
Var(X) = xx (q x q) 


Var(Y) yy (p X p) 
Cou(X,Y) = E(X—p)(Y—v)’=Uxy=Dyx (qx p). 


Using (3.7) and (4.26), 


a! Sgyd 


(a’ Sx xa)!/2 (bT Syyb) 1/2 , (14.2) 


p(a,b) = 


Therefore, p(ca,b) = p(a,b) for any c € Rt. Given the invariance of scale we may rescale 
projections a and b and thus we can equally solve 


max = a See) 
a, 


under the constraints 


a Devt = |] 
b'Syyb = 1. 
For this problem, define 
K = Dy Poxydyy”. (14.3) 


Recall the singular value decomposition of K(q x p) from Theorem 2.2. The matrix K may 
be decomposed as 
= TAA" 


with 

To = (+--+ 7k) 
Ouses 00 (14.4) 
A= diesO7 axa?) 


> 
I 


where by (14.3) and (2.15), 


k =rank(K) = rank(Sxy) = rank(Nyx) , 
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and \; > Ap >... Ax are the nonzero eigenvalues of Ny = KK! and No = K'K and 4; and 
6; are the standardized eigenvectors of Ny and N respectively. 


Define now for i = 1,...,k the vectors 
Gj = 3% (14.5) 
by = Sy by (14.6) 


which are called the canonical correlation vectors. Using these canonical correlation vectors 
we define the canonical correlation variables 


m = 0, X (14.7) 
a = oy. (14.8) 
The quantities p; = pe 2 for § = 1,...,k are called the canonical correlation coefficients. 


From the properties of the singular value decomposition given in (14.4) we have 
: r, fl #53, ; 
Cou(m,7;) = a; Ux xaj = Ve VG 0 i # j. (1 9) 


The same is true for Cov(y;, p;). The following theorem tells us that the canonical correlation 
vectors are the solution to the maximization problem of (14.1). 


THEOREM 14.1 For any givenr, 1<r<k, the maximum 


Cir) = max a’ Dxyb (14.10) 
subject to 
i = 7 _ 
a Lxxa=1, b Myyb = 1 
and 
aj Uxxa=0 fori=l,...,r—1 
is given by 


C(r) = pr = Ay 
and is attained when a = a, and b= b,. 


Proof: 
The proof is given in three steps. 


(i) Fix a and maximize over J, i.e., solve: 


max (a"Sixyb) = max (b'Xyxa) (a'Xxyd) 
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subject to b'Nyyb = 1. By Theorem 2.5 the maximum is given by the largest eigenvalue of 
the matrix 
YyyLyxaa! Uxy. 


By Corollary 2.2, the only nonzero eigenvalue equals 
a’ Dixy Dyyhyxa. (14.11) 


(ii) Maximize (14.11) over a subject to the constraints of the Theorem. Put y = ers and 
observe that (14.11) equals 


“ Uge Day dyy lye lee Yay KK. 
Thus, solve the equivalent problem 


max y' My (14.12) 
| 


subject toy y= 1,7) y=0 for7z=1,...;7=1, 


Note that the ¥;’s are the eigenvectors of MN corresponding to its first r—1 largest eigenvalues. 
Thus, as in Theorem 9.3, the maximum in (14.12) is obtained by setting y equal to the 
eigenvector corresponding to the r-th largest eigenvalue, i.e., y = 7, or equivalently a = a,. 
This yields 

C?(r) = ag Nive = Ane Y= Ae 


(iii) Show that the maximum is attained for a = a, and b = b,. From the SVD of K we 
conclude that K0, = p,7y, and hence 


Gy) Sxl =) Kon = pe, r= a 


Let 


-1/2 
by = Dy!" 
maximize the correlation between the canonical variables 
m= a, X ; 


The covariance of the canonical variables 7 and ¢ is given in the next theorem. 
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THEOREM 14.2 Let 7; and y; be the i-th canonical correlation variables (i = 1,...,k). 
Define n = (m,---,) and yp = (Y1,.--, ~~). Then 


eee? 
with A given in (14.4). 


This theorem shows that the canonical correlation coefficients, p; = i / . are the covariances 
between the canonical variables 7; and ; and that the indices n, = a] X and y, = b] Y have 
the maximum covariance VA; = /1. 


The following theorem shows that canonical correlations are invariant w.r.t. linear transfor- 
mations of the original variables. 


THEOREM 14.3 Let X* =U'X +u and Y* =V'Y +v where U and V are nonsingular 
matrices. Then the canonical correlations between X* and Y* are the same as those between 
X and Y. The canonical correlation vectors of X* and Y* are given by 


= 
a; = Uu ai, 


bY = Vb; (14.13) 


a 


an Summary 


< Canonical correlation analysis aims to identify possible links between two 
(sub-)sets of variables X € R4 and Y € R?’. The idea is to find indices 
a'X and b'Y such that the correlation p(a,b) = pat xpty is maximal. 


<— The maximum correlation (under constraints) is attained by setting a; = 
sy and b; = aes “6, where ¥; and 6; denote the eigenvectors of Ki! 
and KK, K= See gh ree respectively. 


The vectors a; and 0; are called canonical correlation vectors. 


[ 


t 


The indices n; = a; X and y; = 6; Y are called canonical correlation 
variables. 

<— The values py = VAj,---; Pk = VAx, which are the square roots of the 
nonzero eigenvalues of KKC' and K'K, are called the canonical correlation 
coefficients. The covariance between the canonical correlation variables is 


Cou(m, i) — VX t= I -ciedndo ie 
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Summary (continued) 


<> The first canonical variables, 7; = a; X and y,; = b; Y, have the maximum 
covariance V/A}. 


<+ Canonical correlations are invariant w.r.t. linear transformations of the 
original variables X and Y. 


14.2 Canonical Correlation in Practice 


In practice we have to estimate the covariance matrices Nyx, Nyy and Nyy. Let us apply 
the canonical correlation analysis to the car marks data (see Table B.7). In the context of 
this data set one is interested in relating price variables with variables such as sportiness, 
safety, etc. In particular, we would like to investigate the relation between the two variables 
non-depreciation of value and price of the car and all other variables. 


EXAMPLE 14.1 We perform the canonical correlation analysis on the data matrices X 
and Y that correspond to the set of values { Price, Value Stability} and {Economy, Service, 
Design, Sporty car, Safety, Easy handling}, respectively. The estimated covariance matrix 
S is given by 


Price Value Econ. Serv. Design Sport. Safety Easy h. 


1.41 —1.11 0.78 —0.71 —0.90 —1.04 —0.95 0.18 

—1.11 1.19] —042 0.82 0.77 O90 1.12 O11 

0.78 —0.42 0.75 —0.23 —0.45 —0.42 —0.28 0.28 

S=| —0.71 0.82] —0.23 0.66 0.52 0.57 0.85 0.14 
—0.90 0.77} —0.45 0.52 0.72 0.77 0.68 —0.10 

—1.04 0.90 | —0.42 0.57 0.77 1.05 0.76 —0.15 

—0.95 1.12}| —0.28 0.85 068 0.76 1.26 0.22 

0.18 0.11 0.28 0.14 —0.10 —0.15 0.22 0.32 


_ 14a iii —— 078 =—0.71 =090 —1,04 =0,95 018 
= silat dag gf? "8" "\ 2645 032 O77 “090 149 O21 7° 


0.75 —0.23 —0.45 —0.42 —0.28 0.28 

—0.23 0.66 0.52 0.57 0.85 0.14 

Sar & —0.45 0.52 0.72 0.77 0.68 —0.10 
—0.42 0.57 0.77 1.05 0.76 —0.15 

—0.28 0.85 0.68 0.76 1.26 0.22 

0.28 0.14 -—0.10 —0.15 0.22 0.32 
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It is interesting to see that value stability and price have a negative covariance. This makes 
sense since highly priced vehicles tend to loose their market value at a faster pace than 
medium priced vehicles. 


Now we estimate K = es Uxy ee by 


K = Sy" Sxy Spy? 
and perform a singular value decomposition of K: 
K=GLD! = (qu, 92) diag)’, 6”) (di, da)" 
where the £;’s are the eigenvalues of KK and KK with rank(K) = 2, and g; and d; are the 
eigenvectors of KK' and K'K, respectively. The canonical correlation coefficients are 
1/2 


r= 0? =0.98, r= 0)? =0.89. 


The high correlation of the first two canonical variables can be seen in Figure 14.1. The first 
canonical variables are 


hh = @lx = 1.602 2, + 1.686 x» 
Co by = 0.568 yy + 0.544 yo — 0.012 y3 — 0.096 ys — 0.014 ys + 0.915 ys. 


Note that the variables y,; (economy), y2 (service) and ye (easy handling) have positive co- 
efficients on ~,. The variables y3 (design), ys (sporty car) and ys (safety) have a negative 
influence on (1. 


The canonical variable n, may be interpreted as a price and value index. The canonical 
variable ~, is mainly formed from the qualitative variables economy, service and handling 
with negative weights on design, safety and sportiness. These variables may therefore be 
interpreted as an appreciation of the value of the car. The sportiness has a negative effect 
on the price and value index, as do the design and the safety features. 


Testing the canonical correlation coefficients 


The hypothesis that the two sets of variables ¥ and JY are uncorrelated may be tested (under 
normality assumptions) with Wilk’s likelihood ratio statistic (Gibbins, 1985): 


k 
77" = |Z — SyySyxSxxSxv| = [][ (1-4). 


i=1 
This statistic unfortunately has a rather complicated distribution. Bartlett (1939) provides 
an approximation for large n: 


—{n —(p+q+83)/2}log ] [1 - hi) ~ x25. (14.14) 


i=1 
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car marks data 


bmw 
ferrari 
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citroen 


fiat 


Figure 14.1. The first canonical variables for the car marks data. 
Q MVAcancarm.xpl 


A test of the hypothesis that only s of the canonical correlation coefficients are non-zero may 
be based (asymptotically) on the statistic 


k 
—{n-(p+q+3)/2}log [] -b) ~xG-s\@-8): (14.15) 
i=st+1 


EXAMPLE 14.2 Consider Example 14.1 again. There are n = 40 persons that have rated 
the cars according to different categories with p = 2 and q = 6. The canonical correlation 
coefficients were found to be r; = 0.98 and rz = 0.89. Bartlett’s statistic (14.14) is therefore 


—{40 — (2+6+3)/2} log{(1 — 0.987)(1 — 0.897)} = 165.59 ~ x7, 


which is highly significant (the 99% quantile of the x2, is 26.23). The hypothesis of no 
correlation between the variables X and Y is therefore rejected. 
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Let us now test whether the second canonical correlation coefficient is different from zero. 
We use Bartlett’s statistic (14.15) with s = 1 and obtain 


—{40 — (2+6+3)/2} log{(1 — 0.897)} = 54.19 ~ x2 
which is again highly significant with the x? distribution. 


Canonical Correlation Analysis with qualitative data 


The canonical correlation technique may also be applied to qualitative data. Consider for 
example the contingency table N of the French baccalauréat data. The dataset is given 
in Table B.8 in Appendix B.8. The CCA cannot be applied directly to this contingency 
table since the table does not correspond to the usual data matrix structure. We may wish, 
however, to explain the relationship between the row r and column c categories. It is possible 
to represent the data in a (n x (r+c)) data matrix Z = (¥, Y) where n is the total number 
of frequencies in the contingency table MN and X and JY are matrices of zero-one dummy 
variables. More precisely, let 


a 1 if the k-th individual belongs to the i-th row category 
- 0 otherwise 


and 
ee if the k-th individual belongs to the j-th column category 
Yi 1 0 otherwise 
where the indices range from k = 1,...,n,71=1,...,r and 7 = 1,...,c. Denote the cell 


frequencies by nj; so that WV = (nj;;) and note that 
a 
(Yj) = Nis, 


where 2) (yj)) denotes the i-th (j-th) column of ¥ (Y). 


EXAMPLE 14.3 Consider the following example where 
3 2 
vs ( 1 4 ) 


The matriz X is therefore 


oo oo OF FF FE Fe 
ee rR re rF Ooo oO Oo Om 
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the matrix Y is 


coocoocoroOrF Fe 
BPrRPrPrFPOFRFROCO 


and the data matrix Z is 


Oooo OO FFP FF Fe 
ees RRB re OOOO eo 
oqoooocreoor FF Fe 
Beer OFrRrFOOO 


The element nz of N may be obtained by multiplying the first column of X with the second 
column of Y to yield 


The purpose is to find the canonical variables 7 = a'x and y = b'y that are maximally 


correlated. Note, however, that x has only one non-zero component and therefore an “indi- 
vidual” may be directly associated with its canonical variables or score (a;,b;). There will 
be ni; points at each (a;,b;) and the correlation represented by these points may serve as a 
measure of dependence between the rows and columns of NV. 


Let Z = (X,Y) denote a data matrix constructed from a contingency table NV. Similar to 
Chapter 12 define 
C= Xie = >: ij, 
j=l 


ip 
d = Lej = ) Nig, 
i=1 
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and define C = diag(c) and D = diag(d). Suppose that x;. > 0 and x.,; > 0 for all i and 7. 
It is not hard to see that 


nS = Z "HZ = 2" — neg = ( MOxK MOxy 
nSyx nSyy 


o n C—n tel NN 

~ Aged N'IN'  D-—n-1dd' 
where WV = cd™ /n is the estimated value of NV under the assumption of independence of the 
row and column categories. 


Note that 
(n —1)Sxx1, = C1, —n-*ee"l, = e— e(n*e'1,) =e—e(n-'n) =0 


and therefore $4, does not exist. The same is true for Sy}. One way out of this difficulty 
is to drop one column from both 4¥ and Jy, say the first column. Let ¢ and d denote the 
vectors obtained by deleting the first component of c and d. 


Define C, D and Sxx, Syy, Sxy accordingly and obtain 


(nSxx)* = Co + ti, Well 


r 


(nSyy) — D + ii, lel 


Cc 


so that (14.3) exists. The score associated with an individual contained in the first row 
(column) category of NV is 0. 


The technique described here for purely qualitative data may also be used when the data 
is a mixture of qualitative and quantitative characteristics. One has to “blow up” the data 
matrix by dummy zero-one values for the qualitative data variables. 


mn Summary 


<+ In practice we estimate Uxx, Uxy, Nyy by the empirical covariances and 
use them to compute estimates ¢;, g;, di for Ai, yi, 6; from the SVD of 
K=S,7 SwS;, . 

<< The signs of the coefficients of the canonical variables tell us the direction 
of the influence of these variables. 
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14.3. Exercises 


EXERCISE 14.1 Show that the eigenvalues of KK' and K'K are identical. (Hint: Use 
Theorem 2.6) 


EXERCISE 14.2 Perform the canonical correlation analysis for the following subsets of vari- 
ables: X corresponding to {price} and Y corresponding to {economy, easy handling} from 
the car marks data (Table B.7). 


EXERCISE 14.3 Calculate the second canonical variables for Example 14.1. Interpret the 
coefficients. 


EXERCISE 14.4 Use the SVD of matrix K to show that the canonical variables n, and 2 
are not correlated. 


EXERCISE 14.5 Verify that the number of nonzero eigenvalues of matrix K is equal to 
rank(Uixy). 


EXERCISE 14.6 Express the singular value decomposition of matrices K and K' using 
eigenvalues and eigenvectors of matrices K'K and KK'. 


EXERCISE 14.7 What will be the result of CCA for Y = X ? 
EXERCISE 14.8 What will be the results of CCA for Y = 2X and for Y = —X? 


EXERCISE 14.9 What results do you expect if you perform CCA for X and Y such that 
xy = 0? What if Uxy = Ly? 


15 Multidimensional Scaling 


One major aim of multivariate data analysis is dimension reduction. For data measured in 
Euclidean coordinates, Factor Analysis and Principal Component Analysis are dominantly 
used tools. In many applied sciences data is recorded as ranked information. For example, 
in marketing, one may record “product A is better than product B”. High-dimensional 
observations therefore often have mixed data characteristics and contain relative information 
(w.r.t. a defined standard) rather than absolute coordinates that would enable us to employ 
one of the multivariate techniques presented so far. 


Multidimensional scaling (MDS) is a method based on proximities between objects, subjects, 
or stimuli used to produce a spatial representation of these items. Proximities express the 
similarity or dissimilarity between data objects. It is a dimension reduction technique since 
the aim is to find a set of points in low dimension (typically 2 dimensions) that reflect the 
relative configuration of the high-dimensional data objects. The metric MDS is concerned 
with such a representation in Euclidean coordinates. The desired projections are found via 
an appropriate spectral decomposition of a distance matrix. 


The metric MDS solution may result in projections of data objects that conflict with the 
ranking of the original observations. The nonmetric MDS solves this problem by iterating 
between a monotizing algorithmic step and a least squares projection step. The examples 
presented in this chapter are based on reconstructing a map from a distance matrix and on 
marketing concerns such as ranking of the outfit of cars. 


15.1 The Problem 


Multidimensional scaling (MDS) is a mathematical tool that uses proximities between ob- 
jects, subjects or stimuli to produce a spatial representation of these items. The proximities 
are defined as any set of numbers that express the amount of similarity or dissimilarity be- 
tween pairs of objects, subjects or stimuli. In contrast to the techniques considered so far, 
MDS does not start from the raw multivariate data matrix ¥, but from a (n x n) dissimilar- 
ity or distance matrix, D, with the elements 0;; and dj; respectively. Hence, the underlying 
dimensionality of the data under investigation is in general not known. 
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Figure 15.1. Metric MDS solution for the inter-city road distances. 
Q MVAMDScity1.xpl 


MDS is a data reduction technique because it is concerned with the problem of finding a set 
of points in low dimension that represents the “configuration” of data in high dimension. 
The “configuration” in high dimension is represented by the distance or dissimilarity matrix 
2D, 


MDs$-techniques are often used to understand how people perceive and evaluate certain 
signals and information. For instance, political scientists use MDS techniques to understand 
why political candidates are perceived by voters as being similar or dissimilar. Psychologists 
use MDS to understand the perceptions and evaluations of speech, colors and personality 
traits, among other things. Last but not least, in marketing researchers use MDS techniques 
to shed light on the way consumers evaluate brands and to assess the relationship between 
product attributes. 


In short, the primary purpose of all MDS-techniques is to uncover structural relations or 
patterns in the data and to represent it in a simple geometrical model or picture. One 
of the aims is to determine the dimension of the model (the goal is a low-dimensional, 


15.1. The Problem 375 


Map of German cities 


1000 


© Rostock 
fe) Hamburg 


O Berlin 


O Dresden 


NORTH-SOUTH-DIRECTION in km 


© Koblenz 


© Muenchen 
T T T T T T T T T 
0 100 200 300 400 500 600 700 800 900 1000 
EAST-WEST-DIRECTION in km 


Figure 15.2. Metric MDS solution for the inter-city road distances after 
reflection and 90° rotation. Q MVAMDScity2.xpl 


easily interpretable model) by finding the d-dimensional space in which there is maximum 
correspondence between the observed proximities and the distances between points measured 
on a metric scale. 


Multidimensional scaling based on proximities is usually referred to as metric MDS, whereas 
the more popular nonmetric MDS is used when the proximities are measured on an ordinal 
scale. 


EXAMPLE 15.1 A good example of how MDS works is given by Dillon and Goldstein (1984) 
(Page 108). Suppose one is confronted with a map of Germany and asked to measure, with 
the use of a ruler and the scale of the map, some inter-city distances. Admittedly this is 
quite an easy exercise. However, let us now reverse the problem: One is given a set of 
distances, as in Table 15.1, and is asked to recreate the map itself. This is a far more 
difficult exercise, though it can be solved with a ruler and a compass in two dimensions. 
MDS is a method for solving this reverse problem in arbitrary dimensions. In Figure 15.2 
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Berlin Dresden Hamburg Koblenz Munich Rostock 


Berlin 0 214 279 610 596 237 
Dresden 0 492 533 496 444 
Hamburg 0 520 772 140 
Koblenz 0 521 687 
Munich 0 771 
Rostock 0 


Table 15.1. Inter-city distances. 


Audi 100 BMW 5. Citroen AX Ferrari 


Audi 100 0 2.232 3.451 3.689 
BMW 5 2.232 0 5.513 3.167 
Citroen AX 3.451 5.513 0 6.202 


Ferrari 3.689 3.167 6.202 0 


Table 15.2. Dissimilarities for cars. 


you can see the graphical representation of the metric MDS solution to Table 15.1 after 
rotating and reflecting the points representing the cities. Note that the distances given in 
Table 15.1 are road distances that in general do not correspond to Euclidean distances. In 
real-life applications, the problems are exceedingly more complex: there are usually errors in 
the data and the dimensionality is rarely known in advance. 


EXAMPLE 15.2 A further example is given in Table 15.2 where consumers noted their 
impressions of the dissimilarity of certain cars. The dissimilarities in this table were in 
fact computed from Table B.7 as Euclidean distances 


8 


diy = 4| > @a — 24). 


[=1 


MDS produces Figure 15.3 which shows a nonlinear relationship for all the cars in the pro- 
jection. This enables us to build a nonlinear (quadratic) index with the Wartburg and the 
Trabant on the left and the Ferrari and the Jaguar on the right. We can construct an order 
or ranking of the cars based on the subjective impression of the consumers. 


What does the ranking describe? The answer is given by Figure 15.4 which shows the cor- 
relation between the MDS projection and the variables. Apparently, the first MDS direction 
is highly correlated with service(-), value(-), design(-), sportiness(-), safety(-) and price(+). 
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Figure 15.3. MDS solution on the car data. Q MVAmdscarm.xpl 


We can interpret the first direction as the price direction since a bad mark in price (“high 
price”) obviously corresponds with a good mark, say, in sportiness (“very sportive”). The 
second MDS direction is highly positively correlated with practicability. We observe from this 
data an almost orthogonal relationship between price and practicability. 


In MDS a map is constructed in Euclidean space that corresponds to given distances. Which 
solution can we expect? The solution is determined only up to rotation, reflection and shifts. 
In general, if P,,...,P, with coordinates x; = (ra, ..., Lip) for i = 1,...,n represents a MDS 
solution in p dimensions, then y; = Ax; + 6 with an orthogonal matrix A and a shift vector 
b also represents a MDS solution. A comparison of Figure 15.1 and Figure 15.2 illustrates 
this fact. 


Solution methods that use only the rank order of the distances are termed nonmetric methods 
of MDS. Methods aimed at finding the points P; directly from a distance matrix like the one 
in the Table 15.2 are called metric methods. 
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Correlations MDS/Variables 


Figure 15.4. Correlation between the MDS direction and the variables. 
Q MVAmdscarm.xpl 


na Summary 


<+ MDS is aset of techniques which use distances or dissimilarities to project 
high-dimensional data into a low-dimensional space essential in under- 
standing respondents perceptions and evaluations for all sorts of items. 


<+ MDS starts with a (n x n) proximity matrix D consisting of dissimilarities 
0;,; or distances d;;. 


MDS is an explorative technique and focuses on data reduction. 


{ 


The MDS-solution is indeterminate with respect to rotation, reflection 
and shifts. 


<— The MDS-techniques are divided into metric MDS and nonmetric MDS. 


[ 
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15.2 Metric Multidimensional Scaling 


Metric MDS begins with a (n x n) distance matrix D with elements d;; where i,j = 1,...,n. 
The objective of metric MDS is to find a configuration of points in p-dimensional space 
from the distances between the points such that the coordinates of the n points along the p 
dimensions yield a Euclidean distance matrix whose elements are as close as possible to the 
elements of the given distance matrix D. 


15.2.1 The Classical Solution 


The classical solution is based on a distance matrix that is computed from a Euclidean 
geometry. 


DEFINITION 15.1 A (n x n) distance matrix D = (d;;) is Euclidean if for some points 


The following result tells us whether a distance matrix is Euclidean or not. 


THEOREM 15.1 Define A = (aij), ai; = —$d;, and B = HAH where H is the centering 
matriz. D is Euclhidean if and only if B is positive semidefinite. If D is the distance matrix 


of a data matrix X, then B= HXXTH. B is called the inner product matriz. 


Recovery of coordinates 


The task of MDS is to find the original Euclidean coordinates from a given distance matrix. 
Let the coordinates of n points in a p dimensional Euclidean space be given by x; (¢ = 
1,...,) where x; = (2,...,%ip)'. Call ¥ = (21,...,%n)' the coordinate matrix and 
assume % = 0. The Euclidean distance between the i-th and j-th points is given by: 


P 
di, = S— (vin — vjx)”. (15.1) 
k=1 
The general b;; term of B is given by: 
P 
D5 = yeti = ©) £5. (15.2,) 
k=1 


It is possible to derive B from the known squared distances d;;, and then from 6 the unknown 
coordinates. 


di, = 2) a,+ ce — 2x} 2; 
by bys — 205. (15.3) 
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Centering of the coordinate matrix ¥ implies that 57, b;; = 0. Summing (15.3) over 4, 


over j, and over 7 and j, we find: 


i=1 
1 » 
n y dij 

j=l 
1 n n 5 
eo Dy hi 

i=1 j=l 

Solving (15.3) and (15.4) gives: 
ij 


With ayy = 1g? and 


Qj? 


dee = 


we get: 


bij = Giz — Die — 
Define the matrix A as (a;;), and observe that: 
B=HAH. 
The inner product matrix B can be expressed as: 
B=XX', 
where Y = (#1,...,%n)' is the (n x p) matrix of coordinates. The rank of B is then 


rank(B) = rank(X¥X') = rank(X) = p. 


1 
by = —5 (di, — di, — a, + ds). 


(15.4) 


(15.5) 


(15.6) 


(15.7) 


(15.8) 


(15.9) 


(15.10) 


As required in Theorem 15.1 the matrix B is symmetric, positive semidefinite and of rank 
p, and hence it has p non-negative eigenvalues and n — p zero eigenvalues. 6 can now be 


written as: 


B=TAaAr'! 


(15.11) 
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where A = diag(A1,..., Ap), the diagonal matrix of the eigenvalues of B, and F = (1,.--,p), 
the matrix of corresponding eigenvectors. Hence the coordinate matrix VY containing the 
point configuration in R” is given by: 


x =TA?. (15.12) 


How many dimensions? 


The number of desired dimensions is small in order to provide practical interpretations, 
and is given by the rank of B or the number of nonzero eigenvalues ;. If B is positive 
semidefinite, then the number of nonzero eigenvalues gives the number of eigenvalues required 
for representing the distances d;;. 


The proportion of variation explained by p dimensions is given by 


p 
i=1 rj 


Ya Ae 


It can be used for the choice of p. If B is not positive semidefinite we can modify (15.13) to 


(15.13) 


Pp . 
ia Mi (15.14) 


S* (“positive eigenvalues” ) | 


In practice the eigenvalues ; are almost always unequal to zero. To be able to represent the 
objects in a space with dimensions as small as possible we may modify the distance matrix 
to: 
Ded, (15.15) 
with 
0 =9 
dt, = a! 15.16 

J eet ag ( ) 
where e is determined such that the inner product matrix B becomes positive semidefinite 
with a small rank. 


Similarities 


In some situations we do not start with distances but with similarities. The standard trans- 
formation (see Chapter 11) from a similarity matrix C to a distance matrix D is: 


NIH 


diy = (ci = 2 + Ce) . (15.17) 


THEOREM 15.2 [fC < 0, then the distance matrix D defined by (15.17) is Euclidean with 
centered inner product matrix B = HCH. 
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Relation to Factorial Analysis 


Suppose that the (n x p) data matrix ¥ is centered so that Y'# equals a multiple of the 
covariance matrix nS. Suppose that the p eigenvalues \j,...,A,) of nS are distinct and non 
zero. Using the duality Theorem 8.4 of factorial analysis we see that Aj,...,A , are also 
eigenvalues of ¥¥'= B when D is the Euclidean distance matrix between thie rows of ¥. 
The k-dimensional solution to the metric MDS problem is thus given by the k first principal 
components of ¥. 


Optimality properties of the classical MDS solution 


Let X be a (n x p) data matrix with some inter-point distance matrix D. The objective 
of MDS is thus to find ¥,, a representation of VY in a lower dimensional Euclidean space 
R* whose inter-point distance matrix D, is not far from D. Let L = (L1,L2) be a (p x p) 
orthogonal matrix where £; is (p x k). X, = XL, represents a projection of ¥ on the 
column space of £,; in other words, X may be viewed as a fitted configuration of ¥ in R*. 
A measure of discrepancy between D and D, = (d\) is given by 


o= ye 4 — ay. (15.18) 


i,j=1 


THEOREM 15.3 Among all projections XL, of X onto k-dimensional subspaces of R? the 
quantity @ in (15.18) is minimized when X is projected onto its first k principal factors. 


We see therefore that the metric MDS is identical to principal factor analysis as we have 
defined it in Chapter 8. 


an Summary 


<> Metric MDS starts with a distance matrix D. 


<+ The aim of metric MDS is to construct a map in Euclidean space that 
corresponds to the given distances. 
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Summary (continued) 


<> A practical algorithm is given as: 
1. start with distances d;, 
2. define A = —3d?, 


3. put B = (ai; — die — Ge; + Gee) 


4. find the eigenvalues \;,...,A, and the associated eigenvectors 
V1,+++;%p Where the eigenvectors are normalized so that 7,7; = i- 


5. Choose an appropriate number of dimensions p (ideally p = 2) 


6. The coordinates of the n points in the Euclidean space are given by 
Lij = 37 for t= 1,...,n and. p= 1,...,p. 


<+ Metric MDS is identical to principal components analysis. 


15.3) Nonmetric Multidimensional Scaling 


The object of nonmetric MDS, as well as of metric MDS, is to find the coordinates of 
the points in p-dimensional space, so that there is a good agreement between the observed 
proximities and the inter-point distances. The development of nonmetric MDS was motivated 
by two main weaknesses in the metric MDS (Fahrmeir and Hamerle, 1984, Page 679): 


1. the definition of an explicit functional connection between dissimilarities and distances 
in order to derive distances out of given dissimilarities, and 


2. the restriction to Euclidean geometry in order to determine the object configurations. 


The idea of a nonmetric MDS is to demand a less rigid relationship between the dissimilarities 
and the distances. Suppose that an unknown monotonic increasing function f, 


diz = f (diz), (15.19) 
is used to generate a set of distances d,; as a function of given dissimilarities 0;;. Here f has 


the property that if 6;; < 6;s, then f(d:;) < f (ds). The scaling is based on the rank order 
of the dissimilarities. Nonmetric MDS is therefore ordinal in character. 


The most common approach used to determine the elements d;; and to obtain the coordi- 
nates of the objects 71, 2%2,...,%p given only rank order information is an iterative process 
commonly referred to as the Shepard-Kruskal algorithm. 
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Monotonic Regression 


distance 


Figure 15.5. Ranks and distances. Q@ MVAMDSnonmstart.xpl 
15.3.1 Shepard-Kruskal algorithm 


In a first step, called the initial phase, we calculate Euclidean distances a from an arbitrar- 
ily chosen initial configuration % in dimension p*, provided that all objects have different 
coordinates. One might use metric MDS to obtain these initial coordinates. The second 
step or nonmetric phase determines disparities a from the distances a. by constructing 
a monotone regression relationship between the des and 0;;’s, under the requirement that 
if dj < dra, then 7h < do. This is called the weak monotonicity requirement. To ob- 


tain the disparities a, a useful approximation method is the pool-adjacent violators (PAV) 


algorithm (see Figure 15.6). Let 
(41,1) > (t2, Ja) >... > (te, Je) (15.20) 


be the rank order of dissimilarities of the k = n(n — 1)/2 pairs of objects. This corresponds 
to the points in Figure 15.5. The PAV algorithm is described as follows: “beginning with the 
lowest ranked value of 6;;, the adjacent a? values are compared for each 6;; to determine if 
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Pool-Adjacent- Violator-Algorithm 


distance 


rank 


Figure 15.6. Pool-adjacent violators algorithm. @ MVAMDSpooladj.xpl 


they are monotonically related to the 0,;’s. Whenever a block of consecutive values of a 
are encountered that violate the required monotonicity property the a? values are averaged 
together with the most recent non-violator a value to obtain an estimator. Eventually this 
value is assigned to all points in the particular block”. 

In a third step, called the metric phase, the spatial configuration of %% is altered to obtain 
X,. From 2, the new distances dy can be obtained which are more closely related to the 


disparities qe from step two. 


EXAMPLE 15.3 Consider a small example with 4 objects based on the car marks data set. 
Our aim is to find a representation with p* = 2 via MDS. Suppose that we choose as an 
initial configuration of Xo the coordinates given in Table 15.6. The corresponding distances 
dij = \/(a; — 23)" (a; — x5) are calculated in Table 15.7 


A plot of the dissimilarities of Table 15.7 against the distance yields Figure 15.8. This 
relation is not satisfactory since the ranking of the 6;; did not result in a monotone relation 
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j i 2 3 4 
i Mercedes Jaguar Ferrari VW 
1 Mercedes - 
2 Jaguar 3 - 
3. Ferrari 2 1 
4 VW 5 4 6 - 


Table 15.5. Dissimilarities 0,; for car marks. 


i Ti Vig 
1 Mercedes 3 2 
2 Jaguar 2 
3. Ferrari 1 3 
4 VW 10 4 


Table 15.6. Initial coordinates for MDS. 


4,9 di; rank(dj;) de 
12-54 3 3 
1,3 2.2 1 2 
14 7.3 4 5 
23 41 2 1 
2.4 8.5 5 4 
34 9.1 6 6 


Table 15.7. Ranks and distances. 


of the corresponding distances d;;. We apply therefore the PAV algorithm. 


The first violator of monotonicity is the second point (1,3). Therefore we average the dis- 
tances di3 and d23 to obtain the disparities 


13 23 2 2 


Applying the same procedure to (2,4) and (1,4) we obtain dog = dyg = 7.9. The plot of bi; 
versus the disparities dj; represents a monotone regression relationship. 


In the initial configuration (Figure 15.7), the third point (Ferrari) could be moved so that 
the distance to object 2 (Jaguar) is reduced. This procedure however also alters the distance 


between objects 3 and 4. Care should be given when establishing a monotone relation between 
Ow and Chay 
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Initial Configuration 


Jaguar 


Mercedes 


Figure 15.7. Initial configuration of the MDS of the car data. 
Q MVAnmdscar1.xpl 


In order to assess how well the derived configuration fits the given dissimilarities Kruskal 
suggests a measure called STRESS1 that is given by 


uilde = dgy 2 
STRESS1 = (2 1 i) ) , (15.21) 
ae di; 
An alternative stress measure is given by 
ald dg) 2 
STRESS2 = Dic (diy bs) (15.22) 
dics (dig — d)? 


where d denotes the average distance. 
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EXAMPLE 15.4 The Table 15.8 presents the STRESS calculations for the car example. 
The average distance is d = 36.4/6 = 6.1. The corresponding STRESS measures are: 
STRESS1=4/2.6/256 = 0.1 
STRESS2 = ./2.6/36.4 = —0.27. 
The goal is to find a point configuration that balances the effects STRESS and non mono- 


tonicity. This is achieved by an iterative procedure. More precisely, one defines a new 
position of object 7 relative to object j by 


i, 
pW — ey +a (1- $4 (a — Wa); l= eee oe (15.23) 
a 
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(1,5) ij dig dig (ig — dig)” (dig — d)? 
(23) 1 41 315 09 168 38 
(1,3). 2 2.2 3.15 0.9 4.8 14.8 
(12) 3 51 51 0 26.0 0.9 
(2.4) 4 85 79 04 723 6.0 
(14) 5 73 79 04 53.3 1.6 
(3,4) 6 91 91 0 82.8 9.3 
sy 36.3 2.6 256.0 36.4 


Table 15.8. STRESS calculations for car marks example. 


Here a denotes the step width of the iteration. 


By (15.23) the configuration of object 7 is improved relative to object 7. In order to obtain 
an overall improvement relative to all remaining points one uses: 


t— 1, 
j=l ,j#t 


a = di; 


The choice of step width a is crucial. Kruskal proposes a starting value of a = 0.2. The 
iteration is continued by a numerical approximation procedure, such as steepest descent or 
the Newton-Raphson procedure. 


In a fourth step, the evaluation phase, the STRESS measure is used to evaluate whether 
or not its change as a result of the last iteration is sufficiently small that the procedure is 
terminated. At this stage the optimal fit has been obtained for a given dimension. Hence, 
the whole procedure needs to be carried out for several dimensions. 


EXAMPLE 15.5 Let us compute the new point configuration for i = 3 (Ferrari). The initial 
coordinates from Table 15.6 are 
B31 = 1 and 32 = 3. 
Applying (15.24) yields (for a = 3): 
4 r 
wy = ae 2 (-#) (tj — 1) 


g=1978 


3.15 3.15 9.1 
= faa) ae lia ial 
+( so) @ )+( so) | )+( si) ) 
= 1-0.86+0.23+0 


= 0.37, 


Similarly we obtain c¥PW = 4.36. 
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First iteration for Ferrari 


Ferrari_NEW 


Ferrari_init 


@ Mercedes 


Figure 15.9. First iteration for Ferrari. Q MVAnmdscar3.xpl 


To find the appropriate number of dimensions, p*, a plot of the minimum STRESS value as 
a function of the dimensionality is made. One possible criterion in selecting the appropriate 
dimensionality is to look for an elbow in the plot. A rule of thumb that can be used to 
decide if a STRESS value is sufficiently small or not is provided by Kruskal: 


S > 20%, poor; S = 10%, fair; S <5%, good; S =0, perfect. (15.25) 


a? Summary 


<— Nonmetric MDS is only based on the rank order of dissimilarities. 


<< The object of nonmetric MDS is to create a spatial representation of the 
objects with low dimensionality. 
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Summary (continued) 


<> A practical algorithm is given as: 
1. Choose an initial configuration. 
2. Find d;; from the configuration. 
oe Hit dis, the disparities, by the PAV algorithm. 


4. Find a new configuration ¥,41 by using the steepest descent. 


5. Go to 2. 


15.4 Exercises 


EXERCISE 15.1 Apply the MDS method to the Swiss bank note data. What do you expect 
to see ? 


EXERCISE 15.2 Using (15.6), show that (15.7) can be written in the form (15.2). 
EXERCISE 15.3 Show that 

1. Dig = Gee — 20in; Diz = Giz — Cie — Aej + Ave; TF J 

B= >, aa! 

8. Via Mi = Dejan ia = mE 


EXERCISE 15.4 Redo a careful analysis of the car marks data based on the following dis- 
similarity matrix: 


5 1 Z 3 4 
i Nissan Wartburg BMW Audi 
1 Missan - 
2 Wartburg 2 - 
3 BMW 4 6 - 
4 Audi 3 5 1 : 
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EXERCISE 15.5 Apply the MDS method to the U.S. health data. Is the result in accordance 
with the geographic location of the U.S. states? 


EXERCISE 15.6 Redo Exercise 15.5 with the U.S. crime data. 


EXERCISE 15.7 Perform the MDS analysis on the Athletic Records data in Appendix B.18. 
Can you see which countries are “close to each other”? 


16 Conjoint Measurement Analysis 


Conjoint Measurement Analysis plays an important role in marketing. In the design of new 
products it is valuable to know which components carry what kind of utility for the customer. 
Marketing and advertisement strategies are based on the perception of the new product’s 
overall utility. It can be valuable information for a car producer to know whether a change in 
sportiness or a change in safety equipment is perceived as a higher increase in overall utility. 
The Conjoint Measurement Analysis is a method for attributing utilities to the components 
(part worths) on the basis of ranks given to different outcomes (stimuli) of the product. An 
important assumption is that the overall utility is decomposed as a sum of the utilities of 
the components. 


In Section 16.1 we introduce the idea of Conjoint Measurement Analysis. We give two 
examples from the food and car industries. In Section 16.2 we shed light on the problem of 
designing questionnaires for ranking different product outcomes. In Section 16.3 we see that 
the metric solution of estimating the part-worths is given by solving a least squares problem. 
The estimated preference ordering may be nonmonotone. The nonmetric solution strategy 
takes care of this inconsistency by iterating between a least squares solution and the pool 
adjacent violators algorithm. 


16.1 Introduction 


In the design and perception of new products it is important to specify the contributions 
made by to different facets or elements. The overall utility and acceptance of such a new 
product can then be estimated and understood as a possibly additive function of the elemen- 
tary utilities. Examples are the design of cars, a food article or the program of a political 
party. For a new type of margarine one may ask whether a change in taste or presentation 
will enhance the overall perception of the product. The elementary utilities are here the 
presentation style and the taste (e.g., calory content). For a party program one may want to 
investigate whether a stronger ecological or a stronger social orientation gives a better overall 
profile of the party. For the marketing of a new car one may be interested in whether this new 
car should have a stronger active safety equipment or a more sporty note or combinations 
of both. 
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In Conjoint Measurement Analysis one assumes that the overall utility can be explained as 
an additive decomposition of the utilities of different elements. In a sample of questionnaires 
people ranked the product types and thus revealed their preference orderings. The aim is to 
find the decomposition of the overall utility on the basis of observed data and to interpret 
the elementary or marginal utilities. 


EXAMPLE 16.1 A car producer plans to introduce a new car with features that appeal to the 
customer and that may help in promoting future sales. The new elements that are considered 
are safety components (airbag component just for the driver or also for the second front seat) 
and a sporty look (leather steering wheel vs. leather interior). The car producer has thus 4 
lines of cars. 


car 1: basic safety equipment and _ low sportiness 
car 2: basic safety equipment and high sportiness 
car 3: high safety equipment and low sportiness 
car 4: high safety equipment and _ high sportiness 


For the car producer it 1s important to rank these cars and to find out customers’ attitudes 
toward a certain product line in order to develop a suitable marketing scheme. A tester may 
rank the cars as follows: 


car Bel) Det |i 
ranking | 1 | 2| 4] 3 


Table 16.1. Tester’s ranking of cars. 


The elementary utilities here are the safety equipment and the level of sportiness. Conjoint 
Measurement Analysis aims at explaining the rank order given by the test person as a function 
of these elementary utilities. 


EXAMPLE 16.2 A food producer plans to create a new margarine and varies the product 
characteristics “calories” (low vs. high) and “presentation” (a plastic pot vs. paper package) 
(Backhaus, Erichson, Plinke and Weiber, 1996). We can view this in fact as ranking four 
products. 


product 1: low calories and_ plastic pot 
product 2: low calories and paper package 
product 8: high calories and plastic pot 
product 4: high calories and paper package 
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These four fictive products may now be ordered by a set of sample testers as described in 
Table 16.2. 


Product 
ranking 


re] OO 
~ 


Co} re 
a) 


Table 16.2. Tester’s ranking of margarine. 


The Conjoint Measurement Analysis aims to explain such a preference ranking by attributing 
part-worths to the different elements of the product. The part-worths are the utilities of the 
elementary components of the product. 


In interpreting the part-worths one may find that for a test person one of the elements has 
a higher value or utility. This may lead to a new design or to the decision that this utility 
should be emphasized in advertisement schemes. 


ian Summary 


Conjoint Measurement Analysis is used in the design of new products. 


[ 


<< Conjoint Measurement Analysis tries to identify part-worth utilities that 
contribute to an overall utility. 


The part-worths enter additively into an overall utility. 


[ 


The interpretation of the part-worths gives insight into the perception and 
acceptance of the product. 


[ 


16.2 Design of Data Generation 


The product is defined through the properties of the components. A stimulus is defined as 
a combination of the different components. Examples 16.1 and 16.2 had four stimuli each. 
In the margarine example they were the possible combinations of the factors X, (calories) 
and X9 (presentation). If a product property such as 


1 bread 


X3(usage) = ¢ 2 cooking 
3 universal 


is added, then there are 3-2-2 = 12 stimuli. 
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For the automobile Example 16.1 additional characteristics may be engine power and the 
number of doors. Suppose that the engines offered for the new car have 50, 70,90 kW and 
that the car may be produced in 2-, 4-, or 5-door versions. These categories may be coded 


as 
1 50 kW 


X3(power of engine) = ¢ 2 70 kW 
3 90 kW 
and 
1 2 doors 
X,4(doors) = 4 2 4 doors . 
3. 5 doors 


Both X3 and X, have three factor levels each, whereas the first two factors X, (safety) and 
X» (sportiness) have only two levels. Altogether 2- 2-3-3 = 36 stimuli are possible. In a 
questionnaire a tester would have to rank all 36 different products. 


The profile method asks for the utility of each stimulus. This may be time consuming and 
tiring for a test person if there are too many factors and factor levels. Suppose that there 
are 6 properties of components with 3 levels each. This results in 3° = 729 stimuli (i.e., 729 
different products) that a tester would have to rank. 


The two factor method is a simplification and considers only two factors simultaneously. It 
is also called trade-off analysis. The idea is to present just two stimuli at a time and then to 
recombine the information. Trade-off analysis is performed by defining the trade-off matrices 
corresponding to stimuli of two factors only. 


The trade-off matrices for the levels X,, X2 and X3 from the margarine Example 16.2 are 
given below. 


3 | Xe 


www Hl 
ww Hl >< 
aro 
bo bo 


bo wb bw 
ae’ 
bo wb bv 


Table 16.4. Trade-off matrices for margarine. 
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The trade-off matrices for the new car outfit are as follows: 


[Xa] Xs le calle came < 
i fh D: Ss ae hae aa, Icke B 
Bee Oo SoM Be Nhe | Be, 
Be tie SO Sah Be ed Sp) |e. | Nps a 
Cae coe cm 

fr a 
Deli 2 Oe VL. Sel ose lise 
BW! a) ae ie 


Table 16.5. Trade-off matrices for car design. 


The choice between the profile method and the trade-off analysis should be guided by con- 
sideration of the following aspects: 


1. requirements on the test person, 
2. time consumption, 


3. product perception. 


The first aspect relates to the ability of the test person to judge the different stimuli. It is 
certainly an advantage of the trade-off analysis that one only has to consider two factors 
simultaneously. The two factor method can be carried out more easily in a questionnaire 
without an interview. 


The profile method incorporates the possibility of a complete product perception since the 
test person is not confronted with an isolated aspect (2 factors) of the product. The stimuli 
may be presented visually in its final form (e.g., as a picture). With the number of levels 
and properties the number of stimuli rise exponentially with the profile method. The time 
to complete a questionnaire is therefore a factor in the choice of method. 


In general the product perception is the most important aspect and is therefore the profile 
method that is used the most. The time consumption aspect speaks for the trade-off analysis. 
There exist, however, clever strategies on selecting representation subsets of all profiles that 
bound the time investment. We therefore concentrate on the profile method in the following. 
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car Summary 


<>» A stimulus is a combination of different properties of a product. 


<+ Conjoint measurement analysis is based either on a list of all factors (pro- 
file method) or on trade-off matrices (two factor method). 


<+ Trade-off matrices are used if there are too many factor levels. 


<+ Presentation of trade-off matrices makes it easier for testers since only two 
stimuli have to be ranked at a time. 


16.3. Estimation of Preference Orderings 


On the basis of the reported preference values for each stimulus conjoint analysis determines 
the part-worths. Conjoint analysis uses an additive model of the form 


JO Ly hs 
Ye = >) 0 Bul (Xj = 2p) + yw, fork=1,...,K and Vj S_ By =0. ued) 
=.= = 


X; (j =1,..., J) denote the factors, xj, (J = 1,...,£,;) are the levels of each factor X; and 
the coefficients Gj; are the part-worths. The constant 1 denotes an overall level and Y; is 
the observed preference for each stimulus and the total number of stimuli are: 


K=|[4. 


a1 


Equation (16.1) is without an error term for the moment. In order to explain how (16.1) may 
be written in the standard linear model form we first concentrate on J = 2 factors. Suppose 
that the factors engine power and airbag safety equipment have been ranked as follows: 


airbag 

i. 2 

50kW 1)]1 3 

engine 70kW 2/2 6 
90kW 3]4 5 


There are K = 6 preferences altogether. Suppose that the stimuli have been sorted so that 
Y, corresponds to engine level 1 and airbag level 1, Y2 corresponds to engine level 1 and 
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airbag level 2, and so on. Then model (16.1) reads: 


= Bi + Por +p 
= Bi + Poo + ps 
= Bio + Goi +p 


= Pre 
= Pig 
Y6 — P13 


Now we would like to estimate the part-worths 3). 


Boo 


m 


Bax 


m 


Boo 


[L. 


EXAMPLE 16.3 In the margarine example let us consider the part-worths of X,; = usage 


and X2 = calories. We have x4, = 1, X12 = 2, 413 = 3, 21 = 


1 and 022 = 


2. (We 


momentarily re-labeled the factors: X3 became X,). Hence Ly; = 3 and Lz = 2. Suppose that 
a person has ranked the six different products as in Table 16.7. 


Xo (calories) 


low high 
1 2 
bread 1] 2 1 
X, (usage) cooking 2] 3 4 
universal 3] 6 5 


Table 16.7. Ranked products. 


If we order the stimuli as follows: 


Y= Utility(X, = ) 

Yo= Utility(%, = ) 

Ye= Utihty( 44 =2A-45— 1) 
( ) 
( =a) 
( 


we obtain from equation (16.1) the same decomposition as above: 


= 6114+ Go +p 
= B11 + Boo + ps 
= Biz + Bor + ps 
= Biz + Bor + pL 


= Pig 
= Pig 


Bax 


Boe 


bb 
ii 
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X» (calories) 
low high 

il 2 Pare Bu 

bread 1) 2 1 1.5 | —2 

X; (usage) cooking 2] 3 4 3.5 0 

universal 3] 6 5 5.5 2 

Dive 3.66 3.33 | 3.5 

Bar 0.16 —0.16 


Table 16.9. Metric solution for Table 16.7. 


Our aim is to estimate the part-worths (;; as well as possible from a collection of tables like 
Table 16.7 that have been generated by a sample of test persons. First, the so-called metric 
solution to this problem is discussed and then a non-metric solution. 


Metric Solution 


The problem of conjoint measurement analysis can be solved by the technique of Analysis 
of Variance. An important assumption underlying this technique is that the “distance” 
between any two adjacent preference orderings corresponds to the same difference in utility. 
That is, the difference in utility between the products ranked 1st and 2nd is the same as the 
difference in utility between the products ranked 4th and 5th. Put differently, we treat the 
ranking of the products—which is a cardinal variable—as if it were a metric variable. 


Introducing a mean utility 4 equation (16.1) can be rewritten. The mean utility in the above 
Example 16.3 is py = (1+2+34+4+5+6)/6 = 21/6 =3.5. In order to check the deviations 
of the utilities from this mean, we enlarge Table 16.7 by the mean utility p,,,, given a certain 
level of the other factor. The metric solution for the car example is given in Table 16.8: 


Xp» (airbags) 
1 2 Pare Bu 
50kW 1 1 3 2 —1.5 
X, (engine) 70kW 2 2 6 4 | -0.5 
90kW 3 4 3) 4.5 1.5 
Dons 2.33 4.66 | 3.5 
Bos 116 1.16 


Table 16.8. Metric solution for car example. 


16.3 Estimation of Preference Orderings AO1 


Stimulus Y¥, Y¥, Y¥,—-Y, (Y, — Yi)? 


1 2 166 0.33 0.11 
2 1 1.33 0.33 0.11 
3 3 3.66 —0.66 0.44 
4 4 3.33 0.66 0.44 
5 6 5.66 0.33 0.11 
6 5 5.33 —0.33 0.11 
Y 2. 0 1.33 


Table 16.10. Deviations between model and data. 


EXAMPLE 16.4 Jn the margarine example the resulting part-worths for w= 3.5 are 


Pu = —-2 Pa = 0.16 
Bi. = 0 Boo = 0.16. 
Bis = 2 


Note that » Bu =0 (j =1,...,J). The estimated utility Y, for the product with low calories 


and usage re, bread, for example, is: 
Y, = 6 + Ba t= —2+0.16 +3.5 = 1.66. 
The estimated utility Y, for product 4 (cooking (X; = 2) and high calories (X_ = 2)) is: 


Y= Oat fet p= 0— 0.163.539 33: 


The coefficients 3; are computed as pr, — [4, where pr, 1s the average preference ordering 
for each factor level. For instance, pz,, = 1/2* (2+1) =1.5. 


The fit can be evaluated by calculating the deviations of the fitted values to the observed 
preference orderings. In the rightmost column of Table 16.10 the quadratic deviations be- 
tween the observed rankings (utilities) Y;, and the estimated utilities Y, are listed. 


The technique described that generated Table 16.9 is in fact the solution to a least squares 
problem. The conjoint measurement problem (16.1) may be rewritten as a linear regression 
model (with error e = 0): 

Y=XG+e (16.2) 


with X being a design matrix with dummy variables. 4 has the row dimension Kk = [[ L; 
j=l 


J 
(the number of stimuli) and the column dimension D = 5) L;— J. The reason for the 
j=l 
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reduced column number is that per factor only (LZ; — 1) vectors are linearly independent. 
Without loss of generality we may standardize the problem so that the last coefficient of each 
factor is omitted. The error term ¢ is introduced since even for one person the preference 
orderings may not fit the model (16.1). 


EXAMPLE 16.5 /f we rewrite the @ coefficients in the form 


om [b+ 313 + Bop 

Bo | _ Pu — Bis 

6 || Be-Bs es) 
Bu Bo, — Boo 


and define the design matrix X as 


Li) ke We) ek 

1} 1 0); O 

1) 0 1] 1 
v=] 1 a Fllen Ne (16.4) 

LG: Os). do 

1} 0 0; O 

then equation (16.1) leads to the linear model (with error €¢ = 0): 

Y=XB +e. (16.5) 


The least squares solution to this problem is the technique used for Table 16.9. 


In practice we have more than one person to answer the utility rank question for the different 
factor levels. The design matrix is then obtained by stacking the above design matrix n times. 
Hence, for n persons we have as a final design matrix: 


x 
A =1,86 4 = 7 n — times 
x 
which has dimension (nk‘)(Z— J) (where L = 3 L; ) and Y* = (Y,',...,Y,')'. The linear 
model (16.5) can now be written as: - 
Y*=H*B +e". (16.6) 


Given that the test people assign different rankings, the error term ¢* is a necessary part of 
the model. 
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EXAMPLE 16.6 /f we take the @ vector as defined in (16.3) and the design matrix * 
from (16.4), we obtain the coefficients: 


B= 5.33 = fit Big + Bao 

Bo = 4 = Bu- bis 

Ba = 0.33 = a1 — Boe 

Lj , 

~~ Ba = 0 

i=1 

Solving (16.7) we have: 

bu = by — 3 By + Bs = —2 
By = Bs — 7 By + 33 = 0 
Ars = 3 (d+ fs) = 2 (16.8) 
Bay = = 504 = 504 = 0.16 
P31 = —$ (4 = —0.16 


boo = B+ 4 (be 3s) +(G1) = 3.5. 


In fact, we obtain the same estimated part-worths as in Table 16.9. The stimulus k = 2 
corresponds to adding up (311, 322, and ws (see (16.3)). Adding 0, and By gives: 


Yo = 5.33 —4= 1.33. 


Nonmetric solution 


If we drop the assumption that utilities are measured on a metric scale, we have to use (16.1) 
to estimate the coefficients from an adjusted set of estimated utilities. More precisely, we 
may use the monotone ANOVA as developed by Kruskal (1965). The procedure works as 
follows. First, one estimates model (16.1) with the ANOVA technique described above. Then 
one applies a monotone transformation Z= 7 (Y) to the estimated stimulus utilities. The 
monotone transformation f is used because the fitted values Y, from (16.2) of the reported 
preference orderings Y;, may not be monotone. The transformation Z, = f(Y;) is introduced 
to guarantee monotonicity of preference orderings. For the car example the reported Y;, 
values were Y = (1, 3, 2, 6, 4, 5)'. The estimated values are computed as: 


Y, = -15-116+3.5 =0.84 
Yo, = -154+1.16+3.5 =3.16 
Y; = —05-1164+35 =1.84 
Y, = —0.54+1.16+3.5 =4.16 
% = 15=-116+35 =3.84 


Ye = 1541.164+35 =6.16. 
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Car rankings 
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Figure 16.1. Plot of estimated preference orderings vs. revealed rankings 
and PAV fit. Q MVAcarrankings.xpl 


If we make a plot of the estimated preference orderings against the revealed ones, we obtain 
Figure 16.1. 


We see that the estimated Y; = 4.16 is below the estimated A = 6.16 and thus an in- 
consistency in ranking the utilities occurrs. The monotone transformation Z, = f(Y,) is 
introduced to make the relationship in Figure 16.1 monotone. A very simple procedure con- 
sists of averaging the “violators” Y, and Ye to obtain 5.16. The relationship is then monotone 
but the model (16.1) may now be violated. The idea is therefore to iterate these two steps. 
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This procedure is iterated until the stress measure (see Chapter 15) 


(Zp — Yi)? 


STRESS = + —___ (16.9) 
(vi, -Y) 


Mx 


MM 
aa 


> 
ll 


1 


is minimized over @ and the monotone transformation f. The monotone transformation can 
be computed by the so called pool-adjacent-violators (PAV) algorithm. 


mn Summary 


<— The part-worths are estimated via the least squares method. 


: 


The metric solution corresponds to analysis of variance in a linear model. 


<+ The non-metric solution iterates between a monotone regression curve 
fitting and determining the part-worths by ANOVA methodology. 


<> The fitting of data to a monotone function is done via the PAV algorithm. 


16.4 Exercises 


EXERCISE 16.1 Compute the part-worths for the following table of rankings 


Xo 
1 2 
iii 2 
% 2@l2 2 
9/6 5 


EXERCISE 16.2 Consider again Example 16.5. Rewrite the design matrix X and the pa- 
rameter vector 3 so that the overall mean effect 4 is part of X and 2, 1.e., find the matrix 
X’ and 3’ such that Y = Xf". 


EXERCISE 16.3 Compute the design matrix for Example 16.5 for n = 3 persons ranking 
the margarine with X; and Xo. 


EXERCISE 16.4 Construct an analog for Table 16.10 for the car example. 
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EXERCISE 16.5 Compute the part-worths on the basis of the following tables of rankings 
observed on n = 3 persons. 


Xo Xo» 


com 


XxX} XxX, Xj 


ey 
ks 


aw K 
GM Y 29 
D ® 
Sy GM 


1 
2 
3 4 


EXERCISE 16.6 Suppose that in the car example a person has ranked cars by the profile 
method on the following characteristics: 


X, = motor 
Xg =~ safety 
X3 = doors 


There arek = 18 stimuli. 


X1 | Xq | X3 | preference 


marae 1 
1/1] 2 3 
1/1| 3 2 
F0\\. 2 | a 5 
1| 2] 2 Z 
1/218 6 


X1 | X2 | X3 | preference 


2 1 1 7 
2 1 2 & 
2 1 3 9 
2 2 7 10 
2 2 2 12 
2 2 3 11 


X1 | X2 | X3 | preference 


3/1] 1 13 
Vane eae 15 
alee aan) te 
an eae 16 
3 | 2 | 2 17 
BW Be 8 18 


Estimate and analyze the part-worths. 


17 Applications in Finance 


A portfolio is a linear combination of assets. Each asset contributes with a weight c; to 
the portfolio. The performance of such a portfolio is a function of the various returns of 
the assets and of the weights c = (c1,...,¢))'. In this chapter we investigate the “optimal 
choice” of the portfolio weights c. The optimality criterion is the mean-variance efficiency 
of the portfolio. Usually investors are risk-averse, therefore, we can define a mean-variance 
efficient portfolio to be a portfolio that has a minimal variance for a given desired mean 
return. Equivalently, we could try to optimize the weights for the portfolios with maximal 
mean return for a given variance (risk structure). We develop this methodology in the 
situations of (non)existence of riskless assets and discuss relations with the Capital Assets 
Pricing Model (CAPM). 


17.1 Portfolio Choice 


Suppose that one has a portfolio of p assets. The price of asset 7 at time 7 is denoted as pj;;. 
The return from asset 7 in a single time period (day, month, year etc.) is: 


_ Pig — Pi-15 
Pij 
We observe the vectors 2; = (Xj1,.-. stay (i.e., the returns of the assets which are contained 


in the portfolio) over several time periods. We stack these observations into a data matrix 
X = (aj) consisting of observations of a random variable 


X ~ (pu, X). 
The return of the portfolio is the weighted sum of the returns of the p assets: 

=e xX, (17.1) 
where c = (¢1,..-,Cp)' (with yi c; = 1) denotes the proportions of the assets in the 


portfolio. The mean return of the portfolio is given by the expected value of Q, which is 
c'y. The risk or volatility of the portfolio is given by the variance of Q (Theorem 4.6), which 


is equal to two times 


il 
5 ce! De. (17.2) 
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The reason for taking half of the variance of Q@ is merely technical. The optimization of 
(17.2) with respect to c is of course equivalent to minimizing c' Nc. Our aim is to maximize 
the portfolio returns (17.1) given a bound on the volatility (17.2) or vice versa to minimize 
risk given a (desired) mean return of the portfolio. 


iar Summary 


<— Given a matrix of returns ¥ from p assets in n time periods, and that 
the underlying distribution is stationary, i.e., X ~ (4,4), then the (the- 
oretical) return of the portfolio is a weighted sum of the returns of the p 
assets, namely Q =c! X. 


<+ The expected value of Q is c'y. For technical reasons one considers opti- 
mizing 5 c'Dc. The risk or volatility is c' Hc = Var(c' X). 


<+ The portfolio choice, i.e., the selection of c, is such that the return is 
maximized for a given risk bound. 


17.2 Efficient Portfolio 


A variance efficient portfolio is one that keeps the risk (17.2) minimal under the constraint 
that the weights sum to 1, i.e., c'1, = 1. For a variance efficient portfolio, we therefore try 
to find the value of c that minimizes the Lagrangian 


L= : c'De— A(c'1, — 1). (17.3) 


A mean-variance efficient portfolio is defined as one that has minimal variance among all 
portfolios with the same mean. More formally, we have to find a vector of weights c such 
that the variance of the portfolio is minimal subject to two constraints: 

1. a certain, pre-specified mean return 7@ has to be achieved, 

2. the weights have to sum to one. 
Mathematically speaking, we are dealing with an optimization problem under two con- 
straints. 


The Lagrangian function for this problem is given by 


L=c'Net \(f@—c'p)+r(1 —c',). 
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Consolidated Edison 


Figure 17.1. Returns of six firms from January 1978 to December 1987. 
Q MVAreturns.xpl 


With tools presented in Section 2.4 we can calculate the first order condition for a minimum: 


EXAMPLE 17.1 Figure 17.1 shows the returns from January 1978 to December 1987 of six 
stocks traded on the New York stock exchange (Berndt, 1990). For each stock we have chosen 
the same scale on the vertical axis (which gives the return of the stock). Note how the return 
of some stocks, such as Pan American Airways and Delta Airlines, are much more volatile 
than the returns of other stocks, such as IBM or Consolidated Edison (Electric utilities). 


As a very simple example consider two differently weighted portfolios containing only two 
assets, IBM and PanAm. Figure 17.2 displays the monthly returns of the two portfolios. 
The portfolio in the upper panel consists of approximately 10% PanAm assets and 90% IBM 
assets. The portfolio in the lower panel contains an equal proportion of each of the assets. 
The text windows on the right of Figure 17.2 show the exact weights which were used. We 
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: ‘ weights 
equally weighted portfolio ° 
Ss 0.500 IBM 
0.500 Pan Am 


weights 


0.896 IBM 
0.104 Pan Am 


Figure 17.2. Portfolio of IBM and PanAm assets, equal and efficient 
weights. Q MVAportfol.xpl 


can clearly see that the returns of the portfolio with a higher share of the IBM assets (which 
have a low variance) are much less volatile. 


For an exact analysis of the optimization problem (17.4) we distinguish between two cases: 
the existence and nonexistence of a riskless asset. A riskless asset is an asset such as a 
zero bond, i.e., a financial instrument with a fixed nonrandom return (Franke, Hardle and 
Hafner, 2001). 


Nonexistence of a riskless asset 
Assume that the covariance matrix © is invertible (which implies positive definiteness). This 


is equivalent to the nonexistence of a portfolio ¢ with variance c'Nc = 0. If all assets are 
uncorrelated, » is invertible if all of the asset returns have positive variances. A riskless asset 
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All 


(uncorrelated with all other assets) would have zero variance since it has fixed, nonrandom 


returns. In this case © would not be positive definite. 


The optimal weights can be derived from the first order condition (17.4) as 


1 
C= 5E AH + Aq1,). 
Multiplying this by a (p x 1) vector 1, of ones, we obtain 


1 

1=1e= 51,5 1p + dal, ), 

which can be solved for Az to get: 

2— d1) Do 
i>, 


2= 


Plugging this expression into (17.5) yields 


(17.5) 


(17.6) 


For the case of a variance efficient portfolio there is no restriction on the mean of the portfolio 


(A; = 0). The optimal weights are therefore: 
wi, 


c= ——. 
ita“, 


(17.7) 


This formula is identical to the solution of (17.3). Indeed, differentiation with respect to c 


gives 
te = Al, 
= AE, 
If we plug this into (17.3), we obtain 


a 


ohare — \(A1,5711, — 1) 
= \- i 
This quantity is a function of \ and is minimal for 
AS de 
since xe 


——— = > 0. 
Oc! Oc 
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THEOREM 17.1 The variance efficient portfolio weights for returns X ~ (1,%) are 


ae 


alr (17.8) 
oi, 


Copt = 


Existence of a riskless asset 


If an asset exists with variance equal to zero, then the covariance matrix © is not invertible. 
The notation can be adjusted for this case as follows: denote the return of the riskless asset 
by r (under the absence of arbitrage this is the interest rate), and partition the vector and 
the covariance matrix of returns such that the last component is the riskless asset. Thus, 
the last equation of the system (17.4) becomes 


2 Cov(r, X) — Air — Ap = 0, 
and, because the covariance of the riskless asset with any portfolio is zero, we have 
ro = —rx4. (17.9) 


Let us for a moment modify the notation in such a way that in each vector and matrix the 
components corresponding to the riskless asset are excluded. For example, c is the weight 
vector of the risky assets (i.e., assets with positive variance), and cp denotes the proportion 
invested in the riskless asset. Obviously, co = 1 — 1) c, and © the covariance matrix of the 
risky assets, is assumed to be invertible. Solving (17.4) using (17.9) gives 


mM 


=F 


D1 (p — rp). (17.10) 


This equation may be solved for \, by plugging it into the condition u'c = 7. This is the 
mean-variance efficient weight vector of the risky assets if a riskless asset exists. The final 


solution is: 2 
c= pe (w= Php) (17.11) 
ulUoi(u — rly) 


The variance optimal weighting of the assets in the portfolio depends on the structure of the 
covariance matrix as the following corollaries show. 


COROLLARY 17.1 A portfolio of uncorrelated assets whose returns have equal variances 
(X= = 0°T,,) needs to be weighted equally: 


1 
Copt = pe 
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Proof: 
Here we obtain 1} D7'l, = 0-71} 1, = 0p and therefore c = 


a~7 lp 1 1 
—2 
Pp 


COROLLARY 17.2 A portfolio of correlated assets whose returns have equal variances, i.€., 


1 P wee pP 
peas i ; aera a 
pepe: 
needs to be weighted equally: 
Copt = pe 


Proof: 
~ can be rewritten as © = 0? {(1— p)Z, + plpl} }. The inverse is 


pol a Lp plpl, 
or(1—p) o%(1—p){1+(p— 1p} 


since for a (p x p) matrix A of the form A = (a— b)Z,+ bli the inverse is generally given 
by 


= ee eee b1pl, 
A= G-” @—ila+ (— Np} 
Hence 
sae ae ppl ly 
"a (1=p)  o*(1—p){1+(p— lp} 
_ [{l+@-1)p}— erly _ {1 — p}lp 
o*(1—p){1+(p—1)p} 0? (1—p){1+ (p— 1)p} 


lp 


o*{1+(p—1)p} 


which yields 
Pp 


etyt= 
. Po? {1+ (p— 1p} 


and thus c = 41,. 
Pp 


Let us now consider assets with different variances. We will see that in this case the weights 
are adjusted to the risk. 
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COROLLARY 17.3 A portfolio of uncorrelated assets with returns of different variances, 


i.e., © = diag(o?,...,07), has the following optimal weights 


1p 
o,° 
ee ;—] 
Cj, opt D ‘ J geeceeay Ds 
2 
I=1 
Proof: 
From 7! = diag(o7*,...,057) we have 1}U>'1l) = S)?_,0,° and therefore the optimal 
p 
weights are c; = 07/3. of Oo 
I=1 


This result can be generalized for covariance matrices with block structures. 


COROLLARY 17.4 A portfolio of assets with returns X ~ (u,%), where the covariance 
matriz has the form: 


4 O ... 0 
De 0 Ye 
G uc Oe Se 
has optimal weights c = (c,... Cr) given by 
pa 
eee a eer 


car Summary 


— An efficient portfolio is one that keeps the risk minimal under the con- 
straint that a given mean return is achieved and that the weights sum to 
1, ie., that minimizes £ = c'Ye + \y(M@—c'p) + Ao(1 —c'1,). 


<> Ifa riskless asset does not exist, the variance efficient portfolio weights 
are given by 
aes 


c= 
ie 
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Summary (continued) 


<> Ifa riskless asset exists, the mean-variance efficient portfolio weights are 
given by 


_ fix *(u — Ty) 
WS rly) 


<— The efficient weighting depends on the structure of the covariance matrix 
y. Equal variances of the assets in the portfolio lead to equal weights, 


different variances lead to weightings proportional to these variances: 
o,? 
3S ;—] 
‘j,0pt D ) | preey De 


er 
l=1 


17.3. Efficient Portfolios in Practice 


We can now demonstrate the usefulness of this technique by applying our method to the 
monthly market returns computed on the basis of transactions at the New York stock market 
between January 1978 to December 1987 (Berndt, 1990). 


EXAMPLE 17.2 Recall that we had shown the portfolio returns with uniform and optimal 
weights in Figure 17.2. The covariance matrix of the returns of IBM and PanAm is 


Sa 0.0034 0.0016 
~ \ 0.0016 0.0172 } ° 


Hence by (17.7) the optimal weighting is 


S71, 


“~ TS, 


= (0.8957, 0.1043) ". 


The effect of efficient weighting becomes even clearer when we expand the portfolio to six 
assets. The covariance matrix for the returns of all six firms introduced in Example 17.1 is 


0.0035 0.0016 0.0019 0.0003 0.0015 0.0010 
0.0016 0.0172 0.0049 0.0011 0.0019 0.0003 
0.0019 0.0049 0.0091 0.0004 0.0016 0.0010 
0.0003 0.0011 0.0004 0.0025 0.0007 —0.0004 
0.0015 0.0019 0.0016 0.0007 0.0076 0.0021 
0.0010 0.0003 0.0010 —0.0004 0.0021 0.0063 
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: ‘ weights 

equally weighted portfolio 
5 en (ee 
S 0.167 IBM 
on] Ie 0.167 Pan Am 
= 0.167 Delta 
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Figure 17.3. Portfolio of all six assets, equal and efficient weights. 
Q MVAportfol.xpl 


Hence the optimal weighting 1s 


SM, 


os, 


= (0.2504, 0.0039, 0.0409, 0.5087, 0.0072, 0.1890) '. 


As we can clearly see, the optimal weights are quite different from the equal weights (c; = 
1/6). The weights which were used are shown in text windows on the right hand side of 
Figure 17.3. 


This efficient weighting assumes stable covariances between the assets over time. Changing 
covariance structure over time implies weights that depend on time as well. This is part of a 
large body of literature on multivariate volatility models. For a review refer to Franke et al. 
(2001). 


17.4 The Capital Asset Pricing Model (CAPM) A17 


an Summary 


<> Efficient portfolio weighting in practice consists of estimating the covari- 
ances of the assets in the portfolio and then computing efficient weights 
from this empirical covariance matrix. 


<— Note that this efficient weighting assumes stable covariances between the 
assets over time. 


17.4 The Capital Asset Pricing Model (CAPM) 


The CAPM considers the relation between a mean-variance efficient portfolio and an asset 
uncorrelated with this portfolio. Let us denote this specific asset return by yo. The riskless 
asset with constant return yo = r may be such an asset. Recall from (17.4) the condition for 
a mean-variance efficient portfolio: 


2he— Ay — Agl, = 0. 
In order to eliminate \2, we can multiply (17.4) by c' to get: 
2c! Ne = Ai fi = do. 
Plugging this into (17.4), we obtain: 
2Mc-— Ay = 2c'Nel, — Aifily 
Z 
Bw = pfilpt+—(Ze—c'Dcl,). (17.12) 
For the asset that is uncorrelated with the portfolio, equation (17.12) can be written as: 
2 
Yo = P- you 


since Yo = r is the mean return of this asset and is otherwise uncorrelated with the risky 
assets. This yields: 


c! Ne 


Ay = 2= 
Le — Yo 


(17.13) 


and if (17.13) is plugged into (17.12): 
23 LL — Yo 
plipt Se (Se — c'Nel,) 


[Ll 


= Yyolp+ =—-(2- yo) 
= yolp+ B(f— yo) (17.14) 
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with 
Cc 
B — 


~ Ne 


The relation (17.14) holds if there exists any asset that is uncorrelated with the mean- 
variance efficient portfolio c. The existence of a riskless asset is not a necessary condition 
for deriving (17.14). However, for this special case we arrive at the well-known expression 


p=rlp+A(i—r), (17.15) 


which is known as the Capital Asset Pricing Model (CAPM), see Franke et al. (2001). The 
beta factor G measures the relative performance with respect to riskless assets or an index. 
It reflects the sensitivity of an asset with respect to the whole market. The beta factor is 
close to 1 for most assets. A factor of 1.16, for example, means that the asset reacts in 
relation to movements of the whole market (expressed through an index like DAX or DOW 
JONES) 16 percents stronger than the index. This is of course true for both positive and 
negative fluctuations of the whole market. 


iar Summary 


<— The weights of the mean-variance efficient portfolio satisfy 2c — Ayu — 
Ag ly = 0. 


<— In the CAPM the mean of X depends on the riskless asset and the pre- 
specified mean 7 as follows = r1,+ G(fi— 7). 


<>» The beta factor 3 measures the relative performance with respect to risk- 
less assets or an index and reflects the sensitivity of an asset with respect 
to the whole market. 


17.5 Exercises 


EXERCISE 17.1 Prove that the inverse of A= (a—b)Z,+blpl} is given by 


Aa Ts b Tut 
~ (a—b)  (a—b){at (p= 1)b} 


EXERCISE 17.2 The empirical covariance between the 120 returns of IBM and PanAm 
is 0.0016 (see Example 17.2). Test if the true covariance is zero. Hint: Use Fisher’s Z- 
transform. 
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EXERCISE 17.3 Explain why in both Figures 17.2 and 17.8 the portfolios have negative 
returns just before the end of the series, regardless of whether they are optimally weighted or 
not! (What happened in December 1987?) 


EXERCISE 17.4 Apply the method used in Example 17.2 on the same data (Table B.5) 
including also the Digital Equipment company. Obviously one of the weights is negative. Is 
this an efficient weighting? 


EXERCISE 17.5 In the CAPM the @ value tells us about the performance of the portfolio 
relative to the riskless asset. Calculate the GB value for each single stock price series relative 
to the “riskless” asset IBM. 


18 Highly Interactive, Computationally 
Intensive Techniques 


It is generally accepted that training in statistics must include some exposure to the mechan- 
ics of computational statistics. This exposure to computational methods is of an essential 
nature when we consider extremely high dimensional data. Computer aided techniques can 
help us discover dependencies in high dimensions without complicated mathematical tools. 
A draftman’s plot (i.e., a matrix of pairwise scatterplots like in Figure 1.14) may lead us 
immediately to a theoretical hypothesis (on a lower dimensional space) about the relation- 
ship of the variables. Computer aided techniques are therefore at the heart of multivariate 
statistical analysis. 


In this chapter we first present the concept of Simplicial Depth—a multivariate extension 
of the data depth concept of Section 1.1. We then present Projection Pursuit—a semipara- 
metric technique which is based on a one-dimensional, flexible regression or on the idea of 
density smoothing applied to PCA type projections. A similar model is underlying the Sliced 
Inverse Regression (SIR) technique which we discuss in Section 18.3. 


18.1 Simplicial Depth 


Simplicial depth generalizes the notion of data depth as introduced in Section 1.1. This 
general definition allows us to define a multivariate median and to visually present high 
dimensional data in low dimension. For univariate data we have well known parameters of 
location which describe the center of a distribution of a random variable X. These parameters 
are for example the mean 


i= Sm, (18.1) 


or the mode 


Lined = are max f(a), 
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where f is the estimated density function of X (see Section 1.3). The median 


X((n+1)/2) if n odd 
Lmed = 


L(n/2)t+XL(n : 
eee otherwise, 


where x,;) is the order statistics of the n observations 2;, is yet another measure of location. 


The first two parameters can be easily extended to multivariate random variables. The mean 
in higher dimensions is defined as in (18.1) and the mode accordingly, 


Lmod = arg max f(a) 
x 


with f the estimated multidimensional density function of X (see Section 1.3). The median 

poses a problem though since in a multivariate sense we cannot interpret the element-wise 
median 

L((n41)/2),j if n odd 

Umedj = (18.2) 


Lin PPE (yy ; : 
Gig ert)s otherwise 


as a point that is “most central”. The same argument applies to other observations of a 
sample that have a certain “depth” as defined in Section 1.1. The “fourths” or the “extremes” 
are not defined in a straightforward way in higher (not even for two) dimensions. 


An equivalent definition of the median in one dimension is given by the simplicial depth. It 
is defined as follows: For each pair of datapoints x; and x; we generate a closed interval, a 
one-dimensional simplex, which contains 7; and x; as border points. Redefine the median 
as the datapoint %mea, Which is enclosed in the maximum number of intervals: 


med = arg Max #{k, l; LE [zx, xi|}. (18.3) 


With this definition of the median, the median is the “deepest” and “most central” point in 
a data set as discussed in Section 1.1. This definition involves a computationally intensive 
operation since we generate n(n — 1)/2 intervals for n observations. 


In two dimensions, the computation is even more intensive since the interval [x,;, x7] is re- 
placed by a triangle constructed from three different datapoints. The median as the deepest 
point is then defined by that datapoint that is covered by the maximum number of triangles. 
In three dimensions triangles become pyramids formed from 4 points and the median is that 
datapoint that lies in the maximum number of pyramids. 


An example for the depth in 2 dimensions is given by the constellation of points given in 
Figure 18.1. If we build for example the traingle of the points 1, 3, 5 (denoted as A 135 in 
Table 18.1), it contains the point 4. From Table 18.1 we count the number of coverages to 
obtain the simplicial depth values of Table 18.2. 
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simplicial depth example 


© 6 


Figure 18.1. Construction of simplicial depth. Q MVAsimdep1.xpl 


In arbitrary dimension p, we look for datapoints that lie inside a simplex (or convex hull) 
formed from p+1 points. We therefore extend the definition of the median to the multivariate 
case as follows 

Tmea = alg max Hho, «+. 4p; Be € hull (Lp, -..,04,) }- (18.4) 


Here ko,...,k, denote the indices of p+ 1 datapoints. Thus for each datapoint we have a 
multivariate data depth. If we compute all the necessary simplices hull(x,,.,...,%,), the 
computing time will unfortunately be exponential as the dimension increases. 


In Figure 18.2 we calculate the simplicial depth for a two-dimensional, 10 point distribution. 
The deepest point, the two-dimensional median, is indicated as a big star in the center. The 
points with less depth are indicated via grey shades. 
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Triangle Coverages 
1 A 123 1 2 3 
2 A 124 1 2 4 
3 £125 1 2 5 
4 A 126 lL 2 3 4 6 
5 A 134 1 3.4 
6 A135 1 3.4 5 
7 A136 1 3 6 
8 A 145 1 4 5 
9 A 146 1 3.4 6 
10 A 156 1 3 4 5 6 
11 A 234 23 4 
12> 28 235 23 4 5 
13 A 236 23 4 6 
dd 26245 2 4 5 
15 A 246 2 4 6 
16 A 256 2 5 6 
I? .f\ 345 3.4 5 
18 A 346 3.4 6 
19 A 356 5 5 6 
20 A 456 4 5 6 


Table 18.1. Coverages for artificial configuration of points. 


point 1 2 3 4 
depth |} 10 10 12 14 


5 6 
8 8 


Table 18.2. Simplicial depths for artificial configuration of points. 


a? Summary 


<— The “depth” of a datapoint in one dimension can be computed by counting 
all (closed) intervals of two datapoints which contain the datapoint. 


<— The “deepest” datapoint is the central point of the distribution, the me- 
dian. 


— The “depth” of a datapoint in arbitrary dimension p is defined as the 
number of simplices (constructed from p+ 1 points) covering this point. 
It is called simplicial depth. 
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Summary (continued) 


<> A multivariate extension of the median is to take the “deepest” datapoint 
of the distribution. 


<< In the bivariate case we count all triangles of datapoints which contain 
the datapoint to compute its depth. 


18.2 Projection Pursuit 


“Projection Pursuit” stands for a class of exploratory projection techniques. This class con- 
tains statistical methods designed for analyzing high-dimensional data using low-dimensional 
projections. The aim of projection pursuit is to reveal possible nonlinear and therefore in- 
teresting structures hidden in the high-dimensional data. To what extent these structures 
are “interesting” is measured by an index. Exploratory Projection Pursuit (EPP) goes back 
to Kruskal(1969; 1972). The approach was successfully implemented for exploratory pur- 
poses by various other authors. The idea has been applied to regression analysis, density 
estimation, classification and discriminant analysis. 


Exploratory Projection Pursuit 


In EPP, we try to find “interesting” low-dimensional projections of the data. For this 
purpose, a suitable index function I(a@), depending on a normalized projection vector a, 
is used. This function will be defined such that “interesting” views correspond to local 
and global maxima of the function. This approach naturally accompanies the technique of 
principal component analysis (PCA) of the covariance structure of a random vector X. In 
PCA we are interested in finding the axes of the covariance ellipsoid. The index function 
I(q) is in this case the variance of a linear combination a! X subject to the normalizing 
constraint a'a@ = 1 (see Theorem 9.2). If we analyze a sample with a p-dimensional normal 
distribution, the “interesting” high-dimensional structure we find by maximizing this index 
is of course linear. 


There are many possible projection indices, for simplicity the kernel based and polynomial 
based indices are reported. Assume that the p-dimensional random variable X is sphered 
and centered, that is, E(X) = 0 and Var(X) = Z,. This will remove the effect of location, 
scale, and correlation structure. This covariance structure can be achieved easily by the 
Mahalanobis transformation (3.26). 
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Simplicial depth 


Figure 18.2. 10 point distribution with the median shown as a big star in 
the center..Q MVAsimdepex.xpl 


Friedman and Tukey (1974) proposed to investigate the high-dimensional distribution of X 
by considering the index 


Ipra(a) = n° frala’ Xi) (18.5) 
1=1 


where fp, denotes the kernel estimator (see Section 1.3) 
fro(z) = nS) K,(z-a'X;) (18.6) 
j=l 


of the projected data. Note that (18.5) is an estimate of f f?(z)dz where z = a'X isa 
one-dimensional random variable with mean zero and unit variance. If the high-dimensional 
distribution of X is normal, then each projection z = a! X is standard normal since ||a|| = 1 
and since X has been centered and sphered by, e.g., the Mahalanobis transformation. 
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The index should therefore be stable as a function of a if the high-dimensional data is 
in fact normal. Changes in Ipp,(a) with respect to a therefore indicate deviations from 
normality. Hodges and Lehman (1956) showed that, given a mean of zero and unit variance, 
the (compact support) density which minimizes [ f? is uniquely given by 


f(z) = max{0, c(b? — 27)}, 


where c = 3/(20\/5) and b = V5. This is a parabolic density function, which is equal to 
zero outside the interval (—/5, V5). A high value of the Friedman-Tukey index indicates a 
larger departure from the parabolic form. 


An alternative index is based on the negative of the entropy measure, i.e., { —f log f. The 
density for zero mean and unit variance which minimizes the index 


[tees 


is the standard normal density, a far more plausible candidate than the parabolic density as 
a norm from which departure is to be regarded as “interesting”. Thus in using [ f log f as a 
projection index we are really implementing the viewpoint of seeing “interesting” projections 
as departures from normality. Yet another index could be based on the Fisher information 


(see Section 6.2) 
[unre 


To optimize the entropy index, it is necessary to recalculate it at each step of the numerical 
procedure. There is no method of obtaining the index via summary statistics of the multi- 
variate data set, so the workload of the calculation at each iteration is determined by the 
number of observations. It is therefore interesting to look for approximations to the entropy 
index. Jones and Sibson (1987) suggested that deviations from the normal density should 
be considered as 


f(x) = p(a){1 + e(x)} (18.7) 


where the function € satisfies 
[ ewe(uurrau = 0, for r= 0, 1,2: (18.8) 


In order to develop the Jones and Sibson index it is convenient to think in terms of cumulants 
K3 = pg = E(X?), k4 = pa = E(X*) — 3 (see Section 4.2). The standard normal density 
satisfies K3 = K4 = 0, an index with any hope of tracking the entropy index must at least 
incorporate information up to the level of symmetric departures (K3 or K4 not zero) from 
normality. The simplest of such indices is a positive definite quadratic form in K3 and K4. It 
must be invariant under sign-reversal of the data since both a'X and —a!'X should show 
the same kind of departure from normality. Note that «3 is odd under sign-reversal, i.e., 
K3(a'X) = —K3(—a'X). The cumulant «4 is even under sign-reversal, i.e., Ka(a'X) = 
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ka(—a'X). The quadratic form in «3 and K4 measuring departure from normality cannot 
include a mixed «3k, term. 


For the density (18.7) one may conclude with (18.8) that 


[tw log(u)du 5 | elwe(ujdu. 
Now if f is expressed as a Gram-Charliér expansion 
f(a)y(x) = {1+ kg H3(x)/6 + K4H4(x) /24...} (18.9) 


(Kendall and Stuart, 1977, p. 169) where H, is the r-th Hermite polynomial, then the trunca- 
tion of (18.9) and use of orthogonality and normalization properties of Hermite polynomials 
with respect to y yields 


5 | ooe"(aar = («3 + 64/4) /12. 


The index proposed by Jones and Sibson (1987) is therefore 
Tyg(a) = {630 X) + (a7 X)/4}/12. 
This index measures in fact the difference [ flog f — [ plogy. 
EXAMPLE 18.1 The exploratory Projection Pursuit is used on the Swiss bank note data. 


For 50 randomly chosen one-dimensional projections of this six-dimensional dataset we cal- 
culate the Friedman- Tukey index to evaluate how “interesting” their structures are. 


Figure 18.3 shows the density for the standard, normally distributed data (green) and the 
estimated densities for the best (red) and the worst (blue) projections found. A dotplot of 
the projections is also presented. In the lower part of the figure we see the estimated value 
of the Friedman-Tukey index for each computed projection. From this information we can 
judge the non normality of the bank note data set since there is a lot of variation across the 
50 random projections. 


Projection Pursuit Regression 


The problem in projection pursuit regression is to estimate a response surface 
f(x) = EY | 2) 


via approximating functions of the form 


Fle) = So gel Ata) 
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Figure 18.3. | Exploratory Projection Pursuit for the Swiss bank 
notes data (green = standard normal, red = best, blue = worst). 


Q MVAppexample.xpl 


with non-parametric regression functions g,. Given observations {(21, y1), ... 
R the basic algorithm works as follows. 


x; € R® and y; € 


. Set 7) =y andk=1. 


. Minimize 


(Zn; Yn) } with 


Ey = 3 ve — ge(Ag Tn} 


tL 


where A, is an orthogonal projection matrix and g, is a non-parametric regression 


estimator. 


. Compute new residuals 
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4. Increase k and repeat the last two steps until E, becomes small. 


Although this approach seems to be simple, we encounter some problems. One of the most 
serious is that the decomposition of a function into sums of functions of projections may not 
be unique. An example is 


1 
£122 = Top ie! + bz)? = (az, = bz2)?}. 


Improvements of this algorithm were suggested by Friedman and Stuetzle (1981). 


at, Summary 


<+ Exploratory Projection Pursuit is a technique used to find interesting 
structures in high-dimensional data via low-dimensional projections. Since 
the Gaussian distribution represents a standard situation, we define the 
Gaussian distribution as the most uninteresting. 


<>» The search for interesting structures is done via a projection score like the 
Friedman-Tukey index Ipr(a) = f f?. The parabolic distribution has the 
minimal score. We maximize this score over all projections. 


<> The Jones-Sibson index maximizes 
Iyg(a) = {k3(a' X) + KG(a' X)/4}/12 


as a function of a. 


<< The entropy index maximizes 
Ie(a) = f fos 


where f is the density of a! X. 


< In Projection Pursuit Regression the idea is to represent the unknown 
function by a sum of non-parametric regression functions on projections. 
The key problem is in choosing the number of terms and often the inter- 
pretability. 


18.3 Sliced Inverse Regression 431 


18.3 Sliced Inverse Regression 


Sliced inverse regression (SIR) is a dimension reduction method proposed by Duan and Li 
(1991). The idea is to find a smooth regression function that operates on a variable set of 
projections. Given a response variable Y and a (random) vector X € R? of explanatory 
variables, SIR. is based on the model: 


Y = m(6]X,...,8,X,), (18.10) 


where (;,...,(@, are unknown projection vectors, & is unknown and assumed to be less 
than p, m : R*+! — R is an unknown function, and ¢ is the noise random variable with 
E(e|X) =0. 


Model (18.10) describes the situation where the response variable Y depends on the p- 
dimensional variable X only through a k-dimensional subspace. The unknown (,’s, which 
span this space, are called effective dimension reduction directions (EDR-directions). The 
span is denoted as effective dimension reduction space (EDR-space). The aim is to estimate 
the base vectors of this space, for which neither the length nor the direction can be identified. 
Only the space in which they lie is identifiable. 


SIR tries to find this k-dimensional subspace of R? which under the model (18.10) carries 
the essential information of the regression between X and Y. SIR also focuses on small k, 
so that nonparametric methods can be applied for the estimation of m. A direct application 
of nonparametric smoothing to X is for high dimension p generally not possible due to the 
sparseness of the observations. This fact is well known as the curse of dimensionality, see 
Huber (1985). 


The name of SIR comes from computing the inverse regression (IR) curve. That means 
instead of looking for E(Y |X = x), we investigate E(X |Y = y), a curve in R? consisting of 
p one-dimensional regressions. What is the connection between the IR and the SIR model 
(18.10)? The answer is given in the following theorem from Li (1991). 


THEOREM 18.1 Given the model (18.10) and the assumption 


k 
VbER?: E(b'X|@]X =Aia,..., A,X =Bpx) = oot) cf) 2, (18.11) 


i=1 


the centered IR curve E(X |Y = y)— E(X) lies in the linear subspace spanned by the vectors 
u0;,t=1,...,k, where 4 = Cov(X). 


Assumption (18.11) is equivalent to the fact that X has an elliptically symmetric distribution, 
see Cook and Weisberg (1991). Hall and Li (1993) have shown that assumption (18.11) only 
needs to hold for the EDR-directions. 
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It is easy to see that for the standardized variable Z = ~'/?{X — E(X)} the IR curve 
mi(y) = E(Z|Y = y) lies in span(m,..., 7), where n; = ©'/?;. This means that the con- 
ditional expectation m,(y) is moving in span(7,..., 7) depending on y. With b orthogonal 
to span(7,..., 7x), it follows that 


and further that 
mi(y)mi(y)'b = Cov{mi(y)}b = 0. 


As a consequence Cov{E(Z | y)} is degenerated in each direction orthogonal to all EDR- 
directions n; of Z. This suggests the following algorithm. 


First, estimate Cov{m(y)} and then calculate the orthogonal directions of this matrix (for 
example, with eigenvalue/eigenvector decomposition). In general, the estimated covariance 
matrix will have full rank because of random variability, estimation errors and numerical 
imprecision. Therefore, we investigate the eigenvalues of the estimate and ignore eigenvectors 
having small eigenvalues. These eigenvectors 7); are estimates for the EDR-direction 1; of Z. 
We can easily rescale them to estimates B; for the EDR-directions of X by multiplying by 
»-¥/2. but then they are not necessarily orthogonal. SIR is strongly related to PCA. If all 
of the data falls into a single interval, which means that Cov{mi(y)} is equal to Cov(Z), 
SIR coincides with PCA. Obviously, in this case any information about y is ignored. 


The SIR Algorithm 


The algorithm to estimate the EDR-directions via SIR is as follows: 


1. Standardize zx: 
i= Se Gs = ©) 


2. Divide the range of y; into S nonoverlapping intervals (slices) H,, s = 1,...,S. ns 
denotes the number of observations within slice H,, and Iy, the indicator function for 
this slice: 


Ns = S- Tx,(yi). 
i=1 


3. Compute the mean of z; over all slices. This is a crude estimate m, for the inverse 
regression curve M: 
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4. Calculate the estimate for Cov{mi(y)}: 
us S 
V=n! yee 
s=1 


5. Identify the eigenvalues \; and eigenvectors 4; of V. 


6. Transform the standardized EDR-directions 7; back to the original scale. Now the 
estimates for the EDR-directions are given by 


REMARK 18.1 The number of different eigenvalues unequal to zero depends on the number 
of slices. The rank of V cannot be greater than the number of slices—1 (the z; sum up to zero). 
This is a problem for categorical response variables, especially for a binary response—where 
only one direction can be found. 


SIR Il 


In the previous section we learned that it is interesting to consider the IR curve, that is, 
E(X | y). In some situations however SIR does not find the EDR-direction. We overcome 
this difficulty by considering the conditional covariance Cov(X | y) instead of the IR curve. 
An example where the EDR directions are not found via the SIR curve is given below. 


EXAMPLE 18.2 Suppose that (X1,X2)' ~ N(0,Z2) and Y = X?. Then E(X2| y) = 0 
because of independence and E(X,|y) = 0 because of symmetry. Hence, the EDR-direction 
GB = (1,0)! is not found when the IR curve E(X |y) = 0 is considered. 


The conditional variance 
Var(X1|Y =y) = EXP|Y =y) =y, 


offers an alternative way to find 3. It is a function of y while Var(X2|y) is a constant. 


The idea of SIR II is to consider the conditional covariances. The principle of SIR II is 
the same as before: investigation of the IR curve (here the conditional covariance instead of 
the conditional expectation). Unfortunately, the theory of SIR I is more complicated. The 
assumption of the elliptical symmetrical distribution of X has to be more restrictive, i.e., 
assuming the normality of X. 
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Given this assumption, one can show that the vectors with the largest distance to Cov(Z | 

Y = y) —E{Cov(Z | Y = y)} for all y are the most interesting for the EDR-space. An 
appropriate measure for the overall mean distance is, according to Li (1992), 

E (|| [Cou(Z|Y = y) — E{Cov(Z|Y = y)}] bl|?) = (18.12) 

= b'E(|| Cou(Z|y) — E{Cou(Z|y)}||?) b. (18.13) 

Equipped with this distance, we conduct again an eigensystem decomposition, this time 


for the above expectation E(|| Cov(Z|y) — E{Cou(Z|y)}\|?).. Then we take the rescaled 
eigenvectors with the largest eigenvalues as estimates for the unknown EDR-directions. 


The SIR II Algorithm 


The algorithm of SIR II is very similar to the one for SIR, it differs in only two steps. 
Instead of merely computing the mean, the covariance of each slice has to be computed. The 
estimate for the above expectation (18.12) is calculated after computing all slice covariances. 
Finally, decomposition and rescaling are conducted, as before. 


1. Do steps 1 to 3 of the SIR algorithm. 


2. Compute the slice covariance matrix V;: 


V,= 


1 nm 
eas > Ti, (ys) 2i% — Ns FZ, - 
i=1 


3. Calculate the mean over all slice covariances: 


5. Identify the eigenvectors and eigenvalues of V and scale back the eigenvectors. This 
gives estimates for the SIR II EDR-directions: 


8, = ov 9,. 
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True index vs Response 
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Figure 18.4. Plot of the true response versus the true indices. The mono- 
tonic and the convex shapes can be clearly seen. @ MVAsirdata.xpl 


EXAMPLE 18.3 The result of SIR is visualized in four plots in Figure 18.6: the left two 
show the response variable versus the first respectively second direction. The upper right 
plot consists of a three-dimensional plot of the first two directions and the response. The 
last picture shows WV, the ratio of the sum of the first k eigenvalues and the sum of all 
eigenvalues, similar to principal component analysis. 


The data are generated according to the following model: 
2 
yi = G1 x; + (8) 2)? +4 (33 xi) aa 


where the x;’s follow a three-dimensional normal distribution with zero mean, the covariance 
equal to the identity matrix, 82 = (1,-1,—-1)', and 6, = (1,1,1)'. &; is standard, normally 
distributed and n = 300. Corresponding to model (18.10), m(u,v,€) =u+ue+v?+e. The 
situation is depicted in Figure 18.4 and Figure 18.5. 


Both algorithms were conducted using the slicing method with 20 elements in each slice. The 
goal was to find 8, and G2 with SIR. The data are designed such that SIR can detect (3, 
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True index vs Response 
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Figure 18.5. Plot of the true response versus the true indices. The mono- 
tonic and the convex shapes can be clearly seen. @ MVAsirdata.xpl 


by Bo Bs 
0.578 -0.723 -0.266 
0.586 0.201 0.809 
0.568 0.661 -0.524 


Table 18.5. SIR: EDR-directions for simulated data. 


because of the monotonic shape of {3 2; + (3) 2;)3}, while SIR II will search for 32, as in 
this direction the conditional variance on y is varying. 


If we normalize the eigenvalues for the EDR-directions in Table 18.5 such that they sum 
up to one, the resulting vector is (0.852, 0.086, 0.062). As can be seen in the upper left 
plot of Figure 18.6, there is a functional relationship found between the first index Bix and 
the response. Actually, 3, and By are nearly parallel, that is, the normalized inner product 
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XBetal vs Response XBetal XBeta2 Response 


XBeta2 vs Response 


Psi(k) Eigelvalues 
05 


Figure 18.6. SIR: The left plots show the response versus the estimated 
EDR-directions. The upper right plot is a three-dimensional plot of the 
first two directions and the response. The lower right plot shows the 


eigenvalues A; (*) and the cumulative sum (0). @ MVAsirdata.xpl 


BT B1/{\\G1||||Gr||} = 0.9894 is very close to one. 


The second direction along 32 is probably found due to the good approximation, but SIR does 
not provide it clearly, because it is “blind” with respect to the change of variance, as the 
second eigenvalue indicates. 


For SIR II, the normalized eigenvalues are (0.706, 0.185, 0.108), that is, about 69% of the 
variance is explained by the first EDR-direction (Table 18.6). Here, the normalized inner 
product of 3 and By is 0.9992. The estimator By estimates in fact Bo of the simulated model. 
In this case, SIR II found the direction where the second moment varies with respect to 3, x. 


In summary, SIR has found the direction which shows a strong relation regarding the con- 
ditional expectation between 62 and y, and SIR II has found the direction where the 
conditional variance is varying, namely, 33 x. 
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XBetal vs Response XBetal XBeta2 Response 


Figure 18.7. SIR II mainly sees the direction 32. The left plots show the 
response versus the estimated EDR-directions. The upper right plot is a 
three-dimensional plot of the first two directions and the response. The 
lower right plot shows the eigenvalues \; (*) and the cumulative sum (0). 
Q MVAsir2data.xpl 


By Bo Bs 
0.821 0.180 0.446 
-0.442 -0.826 0.370 
-0.361 -0.534 0.815 


Table 18.6. SIR HW: EDR-directions for simulated data. 


The behavior of the two SIR algorithms is as expected. In addition, we have seen that it is 
worthwhile to apply both versions of SIR. It is possible to combine SIR and SIR I (Cook 
and Weisberg, 1991; Li, 1991; Schott, 1994) directly, or to investigate higher conditional 
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moments. For the latter it seems to be difficult to obtain theoretical results. For further 
details on SIR see Kotter (1996). 


a Summary 


<> SIR serves as a dimension reduction tool for regression problems. 


[ 


Inverse regression avoids the curse of dimensionality. 


The dimension reduction can be conducted without estimation of the re- 
gression function y = m(z). 


<+ SIR searches for the effective dimension reduction (EDR) by computing 
the inverse regression IR. 


{ 


SIR II bases the EDR on computing the inverse conditional variance. 
SIR might miss EDR directions that are found by SIR II. 


[ 


18.4 Boston Housing 


Coming back to the Boston housing data set, we compare the results of exploratory projection 
pursuit on the original data ¥ and the transformed data VY motivated in Section 1.8. So we 
exclude X, (indicator of Charles River) from the present analysis. 


The aim of this analysis is to see from a different angle whether our proposed transformations 
yield more normal distributions and whether it will yield data with less outliers. Both effects 
will be visible in our projection pursuit analysis. 


We first apply the Jones and Sibson index to the non-transformed data with 50 randomly 
chosen 13-dimensional directions. Figure 18.8 displays the results in the following form. In 
the lower part, we see the values of the Jones and Sibson index. It should be constant for 
13-dimensional normal data. We observe that this is clearly not the case. In the upper 
part of Figure 18.8 we show the standard normal density as a green curve and two densities 
corresponding to two extreme index values. The red, slim curve corresponds to the maximal 
value of the index among the 50 projections. The blue curve, which is close to the normal, 
corresponds to the minimal value of the Jones and Sibson index. The corresponding values 
of the indices have the same color in the lower part of Figure 18.8. Below the densities, a 
jitter plot shows the distribution of the projected points a! x; (i =1,...,506). We conclude 
from the outlying projection in the red distribution that several points are in conflict with 
the normality assumption. 
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Figure 18.8. Projection Pursuit with the Sibson-Jones index with 13 orig- 
inal variables. Q MVAppsib.xpl 


Figure 18.9 presents an analysis with the same design for the transformed data. We observe 
in the lower part of the figure values that are much lower for the Jones and Sibson index 
(by a factor of 10) with lower variability which suggests that the transformed data is closer 
to the normal. (“Closeness” is interpreted here in the sense of the Jones and Sibson index.) 
This is confirmed by looking to the upper part of Figure 18.9 which has a significantly less 
outlying structure than in Figure 18.8. 


18.5 Exercises 


EXERCISE 18.1 Calculate the Simplicial Depth for the Swiss bank notes data set and com- 
pare the results to the univariate medians. Calculate the Simplicial Depth again for the 
genuine and counterfeit bank notes separately. 
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Figure 18.9. Projection Pursuit with the Sibson-Jones index with 13 trans- 
formed variables. Q MVAppsib.xpl 


EXERCISE 18.2 Construct a configuration of points in R* such that &mea,j from (18.2) is 
not in the “center” of the scatterplot. 


EXERCISE 18.3 Apply the SIR technique to the U.S. companies data with Y = market value 
and X = allother variables. Which directions do you find? 


EXERCISE 18.4 Simulate a data set with X ~ N4(0, 14), Y = (X,+3X2)? + (X3—Xy4)4 +e 
and e ~ N(0,(0.1)?). Use SIR and SIR II to find the EDR directions. 


EXERCISE 18.5 Apply the Projection Pursuit technique on the Swiss bank notes data set 
and compare the results to the PC' analysis and the Fisher discriminant rule. 


EXERCISE 18.6 Apply the SIR and SIR II technique on the car data set in Table B.3 with 
Y = price. 


A Symbols and Notation 


Basics 


Cy 


em Cone. C 
BS Ages)" 


random variables or vectors 


random variables 
random vector 

X has distribution - 
matrices 

matrices 

data matrices 
covariance matrix 


vector of ones (1,...,1)" 
ee 


n-times 


vector of zeros (0,...,0)! 
— 


n-times 


indicator function, i.e. for a set M is I= 1 
on M, I = 0 otherwise 


J/-1 

implication 
equivalence 
approximately equal 
Kronecker product 


if and only if, equivalence 


57 
63 
83 
82 
59 


59 
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Samples 

Coy observations of X and Y 

Vinay ta) sample of n observations of X 


(n x p) data matrix of observations of X1,..., Xp 
Orr A = (Xi) 
the order statistic of 71,...,%p 


centering matrix, H =Z, —n-11,1/ 


Characteristics of Distribution 


f(z) 

f(x,y) 

fx(x), fyly) 
fi) y+ fxg 2a) 
fa() 

F(x) 


Fy, (1), sae J Mp) 


density of X 

joint density of X and Y 

marginal densities of X and Y 

marginal densities of X),...,Xp 

histogram or kernel estimator of f(z) 
distribution function of X 

joint distribution function of X and Y 
marginal distribution functions of X and Y 
marginal distribution functions of Xj,...,X, 
density of the standard normal distribution 
standard normal distribution function 
characteristic function of X 

k-th moment of X 


cumulants or semi-invariants of X 


83 


15 
93 


a2 
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Moments 
EX, EY mean values of random variables or vectors X 82 
and Y 
Oxy = Cov(X,Y) covariance between random variables X and Y 82 
oxx = Var(X) variance of random variable X 82 
Cov( xX, Y 
pxy = eu) correlation between random variables X and Y 86 
J Var(X) Var(Y) 
Uxy = Cov(X,Y) covariance between random vectors X and Y, 
i.e., Cou(X,Y) = E(X — EX)(Y — EY)" 
“xx = Var(X) covariance matrix of the random vector X 
Empirical Moments 
ya ee 
t= — Li average of X sampled by {x;}i=1,.n Wg 
n 
i=1 
lx - , . 
sxy =—) (a;—%)(y;-—Y) empirical covariance of random variables X and 82 
i=1 Y sampled by {xi }i-1 oe, n and {yi}iei ieee n 
1 n 
SG (a; — 2)" empirical variance of random variable X sampled 82 
i=1 by {25 ie 
Txy = ee empirical correlation of X and Y 86 
VSXxXS5YY 
S = {8x,x,} = a He empirical covariance matrix of Xj,...,X, or of 82, 93 
the random vector X = (X1,...,Xp)! 
R= {rx,x,}= D-/2SD-'/2 empirical correlation matrix of Xj,... ,X, or of — 87, 93 


the random vector X = (Xq,...,Xp) 
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Distributions 
v(x) density of the standard normal distribution 
P(x) distribution function of the standard normal dis- 
tribution 
N(0, 1) standard normal or Gaussian distribution 
N(, 07) normal distribution with mean p and variance o? 
No(s 23) p-dimensional normal distribution with mean ju 
and covariance matrix © 
= convergence in distribution 143 
CLT Central Limit Theorem 143 
x? x? distribution with p degrees of freedom 
caer 1 —a quantile of the y? distribution with p de- 
grees of freedom 
ta t-distribution with n degrees of freedom 
bi_a/an 1—a/2 quantile of the t-distribution with n d_f. 
an F-distribution with n and m degrees of freedom 
a ee 1 —a quantile of the F-distribution with n and 


m degrees of freedom 


Mathematical Abbreviations 


tr(A) 


Fall 945, «+4 8%) 
diag(A) 
rank(A) 

det (A) 


trace of matrix A 


convex hull of points {21,..., 7%} 
diagonal of matrix A 
rank of matrix A 


determinant of matrix A 


B Data 


All data sets are available on the MD*base webpage at www.mdtech.de. More detailed 
information on the data sets may be found there. 


B.1 Boston Housing Data 


The Boston housing data set was collected by Harrison and Rubinfeld (1978). They comprise 
506 observations for each census district of the Boston metropolitan area. The data set was 
analyzed in Belsley, Kuh and Welsch (1980). 


per capita crime rate, 

proportion of residential land zoned for large lots, 
proportion of nonretail business acres, 

Charles River (1 if tract bounds river, 0 otherwise), 
nitric oxides concentration, 

average number of rooms per dwelling, 

proportion of owner-occupied units built prior to 1940, 
weighted distances to five Boston employment centers, 
index of accessibility to radial highways, 

full-value property tax rate per $10,000, 

pupil/teacher ratio , 

1000(B — 0.63)?I(B < 0.63) where B is the proportion of blacks , 
% lower status of the population, 

median value of owner-occupied homes in $1000. 
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B.2 Swiss Bank Notes 


Six variables measured on 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes. 
The data stem from Flury and Riedwyl (1988). The columns correspond to the following 6 
variables. 


X,: Length of the bank note, 

Xo: Height of the bank note, measured on the left, 
X3: Height of the bank note, measured on the right, 
X,4: Distance of inner frame to the lower border, 
Xs: Distance of inner frame to the upper border, 
X¢: Length of the diagonal. 


Observations 1-100 are the genuine bank notes and the other 100 observations are the 
counterfeit bank notes. 


Length Height Height Inner Frame’ Inner Frame Diagonal 


(left) (right) (lower) (upper) 
214.8 131.0 131.1 9.0 9.7 141.0 
214.6 129.7 129.7 8.1 9.5 141.7 
214.8 129.7 129.7 8.7 9.6 142.2 
214.8 129.7 129.6 7.5 10.4 142.0 
215.0 129.6 129.7 10.4 7.7 141.8 
215.7 130.8 130.5 9.0 10.1 141.4 
215.5 129.5 129.7 7.9 9.6 141.6 
214.5 129.6 129.2 7.2 0.7 141.7 
214.9 129.4 129.7 8.2 11.0 141.9 
215.2 130.4 130.3 9.2 10.0 140.7 
215.3 130.4 130.3 7.9 11.7 141.8 
215.1 129.5 129.6 Gal 10.5 142.2 
215.2 130.8 129.6 7.9 10.8 141.4 
214.7 129.7 129.7 7.7 0.9 141.7 
215.1 129.9 129.7 7.7 0.8 141.8 
214.5 129.8 129.8 9.3 8.5 141.6 
214.6 129.9 130.1 8.2 9.8 141.7 
215.0 129.9 129.7 9.0 9.0 141.9 
215.2 129.6 129.6 7.4 11.5 141.5 
214.7 130.2 129.9 8.6 10.0 141.9 
215.0 129.9 129.3 8.4 0.0 141.4 
215.6 130.5 130.0 8.1 0.3 141.6 
215.3 130.6 130.0 8.4 10.8 141.5 
215.7 130.2 130.0 8.7 10.0 141.6 
215.1 129.7 129.9 7A 10.8 141.1 
215.3 130.4 130.4 8.0 11.0 142.3 
215.5 130.2 130.1 8.9 9.8 142.4 
215.1 130.3 130.3 9.8 9.5 141.9 
215.1 130.0 130.0 7.4 0.5 141.8 
214.8 129.7 129.3 8.3 9.0 142.0 
215.2 130.1 129.8 7.9 10.7 141.8 
214.8 129.7 129.7 8.6 9.1 142.3 
215.0 130.0 129.6 Tl 10.5 140.7 
215.6 130.4 130.1 8.4 10.3 141.0 
215.9 130.4 130.0 8.9 10.6 141.4 
214.6 130.2 130.2 9.4 9.7 141.8 
215.5 130.3 130.0 8.4 9.7 141.8 
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215.3 
215.3 
213.9 
214.4 
214.8 
214.9 
214.9 
214.8 
214.3 
214.8 
214.8 
214.6 
214.5 
214.6 
215.3 
214.5 
215.4 
214.5 
215.2 
215.7 
215.0 
215.1 
215.1 
215.1 
215.3 
215.4 
214.5 
215.0 
215.2 
214.6 
214.8 
215.1 
214.9 
213.8 
215.2 
215.0 
214.4 
215.2 
214.1 
214.9 
214.6 
215.2 
214.6 
215.1 
214.9 
215.2 
215.2 
215.4 
215.1 
215.2 
215.0 
214.9 
215.0 
214.7 
215.4 
214.9 
214.5 
214.7 
215.6 
215.0 
214.4 
215.1 
214.7 
214.4 


129.9 
130.3 
130.3 
129.8 
130.1 
129.6 
130.4 
129.4 
129.5 
129.9 
129.9 
129.7 
129.0 
129.8 
130.6 
130.1 
130.2 
129.4 
129.7 
130.0 
129.6 
130.1 
130.0 
129.6 
129.7 
129.8 
130.0 
130.0 
130.6 
129.5 
129.7 
129.6 
130.2 
129.8 
129.9 
129.6 
129.9 
129.9 
129.6 
129.9 
129.8 
130.5 
129.9 
129.7 
129.8 
129.7 
130.1 
130.7 
129.9 
129.9 
129.6 
130.3 
129.9 
129.7 
130.0 
129.4 
129.5 
129.6 
129.9 
130.4 
129.7 
130.0 
130.0 
130.1 


129.4 
130.1 
129.0 
129.2 
129.6 
129.4 
129.7 
129.1 
129.4 
129.7 
129.7 
129.8 
129.6 
129.4 
130.0 
130.0 
130.2 
129.5 
129.4 
129.4 
129.4 
129.9 
129.8 
129.3 
129.4 
129.4 
129.5 
129.8 
130.0 
129.2 
129.3 
129.8 
130.2 
129.5 
129.5 
130.2 
129.6 
129.7 
129.3 
130.1 
129.4 
129.8 
129.4 
129.7 
129.6 
129.1 
129.9 
130.2 
129.6 
129.7 
129.2 
129.9 
129.7 
129.3 
129.9 
129.5 
129.3 
129.5 
129.9 
130.3 
129.5 
129.8 
129.4 
130.3 


Peer i 

aia ee PNORAHWHEWOHOUDMOSO: 
F OFrPORR PNR RPrROOR RRP OR RFP RP RRR oa 

Eo RS Ss a OR eas eh Ss tea OB ic Oa Ue Gr ae SR ra 


a 
NS 

ay 

e 


a 
NS 

Peek 

bw oh 


141. 
139. 


450 B Data 


214.9 130.5 130.2 11.0 11.5 139.5 
214.9 130.3 130.1 8.7 11.7 140.2 
215.0 130.4 130.6 9.9 10.9 140.3 
214.7 130.2 130.3 11.8 10.9 139.7 
215.0 130.2 130.2 10.6 10.7 139.9 
215.3 130.3 130.1 9.3 12.1 140.2 
214.8 130.1 130.4 9.8 11.5 139.9 
215.0 130.2 129.9 10.0 11.9 139.4 
215.2 130.6 130.8 10.4 11.2 140.3 
215.2 130.4 130.3 8.0 11.5 139.2 
215.1 130.5 130.3 10.6 11.5 140.1 
215.4 130.7 131.1 9.7 11.8 140.6 
214.9 130.4 129.9 11.4 11.0 139.9 
215.1 130.3 130.0 10.6 10.8 139.7 
215.5 130.4 130.0 8.2 11.2 139.2 
214.7 130.6 130.1 11.8 10.5 139.8 
214.7 130.4 130.1 12.1 10.4 139.9 
214.8 130.5 130.2 11.0 11.0 140.0 
214.4 130.2 129.9 10.1 12.0 139.2 
214.8 130.3 130.4 10.1 12.1 139.6 
215.1 130.6 130.3 12.3 10.2 139.6 
215.3 130.8 131.1 11.6 10.6 140.2 
215.1 130.7 130.4 10.5 11.2 139.7 
214.7 130.5 130.5 9.9 10.3 140.1 
214.9 130.0 130.3 10.2 11.4 139.6 
215.0 130.4 130.4 9.4 11.6 140.2 
215.5 130.7 130.3 10.2 11.8 140.0 
215.1 130.2 130.2 10.1 11.3 140.3 
214.5 130.2 130.6 9.8 12.1 139.9 
214.3 130.2 130.0 10.7 10.5 139.8 
214.5 130.2 129.8 12.3 11.2 139.2 
214.9 130.5 130.2 10.6 11.5 139.9 
214.6 130.2 130.4 10.5 11.8 139.7 
214.2 130.0 130.2 11.0 11.2 139.5 
214.8 130.1 130.1 11.9 11.1 139.5 
214.6 129.8 130.2 10.7 11.1 139.4 
214.9 130.7 130.3 9.3 11.2 138.3 
214.6 130.4 130.4 11.3 0.8 139.8 
214.5 130.5 130.2 11.8 10.2 139.6 
214.8 130.2 130.3 10.0 11.9 139.3 
214.7 130.0 129.4 10.2 11.0 139.2 
214.6 130.2 130.4 11.2 10.7 139.9 
215.0 130.5 130.4 10.6 11.1 139.9 
214.5 129.8 129.8 11.4 10.0 139.3 
214.9 130.6 130.4 11.9 10.5 139.8 
215.0 130.5 130.4 11.4 10.7 139.9 
215.3 130.6 130.3 9.3 11.3 138.1 
214.7 130.2 130.1 10.7 11.0 139.4 
214.9 129.9 130.0 9.9 12.3 139.4 
214.9 130.3 129.9 11.9 10.6 139.8 
214.6 129.9 129.7 11.9 10.1 139.0 
214.6 129.7 129.3 10.4 11.0 139.3 
214.5 130.1 130.1 12.1 10.3 139.4 
214.5 130.3 130.0 11.0 11.5 139.5 
215.1 130.0 130.3 11.6 10.5 139.7 
214.2 129.7 129.6 10.3 11.4 139.5 
214.4 130.1 130.0 11.3 10.7 139.2 
214.8 130.4 130.6 12.5 10.0 139.3 
214.6 130.6 130.1 8.1 12.1 137.9 
215.6 130.1 129.7 7A 12.2 138.4 
214.9 130.5 130.1 9.9 10.2 138.1 
214.6 130.1 130.0 11.5 10.6 139.5 
214.7 130.1 130.2 11.6 10.9 139.1 
214.3 130.3 130.0 11.4 10.5 139.8 
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215.1 
216.3 
215.6 
214.8 
214.9 
213.9 
214.2 
214.8 
214.8 
214.8 
214.9 
214.3 
214.5 
214.8 
214.5 
215.0 
214.8 
215.0 
214.6 
214.7 
214.7 
214.5 
214.8 
214.8 
214.6 
215.1 
215.4 
214.7 
215.0 
214.9 
215.0 
215.1 
214.8 
214.7 
214.3 


130.3 
130.7 
130.4 
129.9 
130.0 
130.7 
130.6 
130.5 
129.6 
130.1 
130.4 
130.1 
130.4 
130.5 
130.2 
130.4 
130.6 
130.5 
130.5 
130.2 
130.4 
130.4 
130.0 
129.9 
130.3 
130.2 
130.5 
130.3 
130.5 
130.3 
130.4 
130.3 
130.3 
130.7 
129.9 


130.6 
130.4 
130.1 
129.8 
129.9 
130.5 
130.4 
130.3 
130.0 
130.0 
130.2 
130.1 
130.0 
130.3 
130.4 
130. 
130.6 
130.1 
130.4 
130. 
130.0 
130.0 
129.7 
130.2 
130.2 
129.8 
130.6 
130.2 
130.3 
130.5 
130.3 
129.9 
130.4 
130.8 
129.9 
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12.0 
10.1 
11.2 
12.0 
10.9 
11.5 
10.2 
10.5 
11.6 
10.5 
10.7 
10.5 
12.0 
12.1 
11.8 
10.7 
11.4 
11.4 
11.4 
L11 
10.7 
12.2 
10.6 
11.9 

9.1 
12.0 
11.0 
11.1 
11.0 
10.6 
12.1 
11.5 
11.1 
11.2 
11.5 


139.7 
138.8 
138.6 
139.6 
139.7 
137.8 
139.6 
139.4 
139.2 
139.6 
139.0 
139.7 
139.6 
139.1 
137.8 
139.1 
138.7 
139.3 
139.3 
139.5 
139.4 
138.5 
139.2 
139.4 
139.2 
139.4 
138.6 
139.2 
138.5 
139.8 
139.6 
139.7 
140.0 
139.4 
139.6 
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B.3 Car Data 


The car data set (Chambers, Cleveland, Kleiner and Tukey, 1983) consists of 13 variables 
measured for 74 car types. The abbreviations in Table B.3 are as follows: 


P Price, 


M Mileage (in miles per gallone), 
R78 Repair record 1978 (rated on a 5—point scale; 5 best, 1 worst), 


R77 Repair record 1977 (scale as before), 

H Headroom (in inches), 

R Rear seat clearance (distance from front seat back to rear seat, in inches), 

Tr Trunk space (in cubic feet), 

W ~ Weight (in pound), 

L Length (in inches), 

T Turning diameter (clearance required to make a U-turn, in feet), 

D Displacement (in cubic inches), 

G Gear ratio for high gear, 

C Company headquarter (1 for U.S., 2 for Japan, 3 for Europe). 
Model P M R78 R77 H RT W LT D Geo 
AMC-Concord 4099 22 3 2 25 27.5 11 2930 186 40 121 3.58 1 
AMC-Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 1 
AMC-Spirit 3799 22 Z — 30 185 12 2640 168 35 121 3.08 1 
Audi-5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 1 
Audi-Fox 6295 23 3 3.25 28.0 11 2070 174 36 97 3.70 3 
BMW-320i 9735 25 4 4 25 26.0 12 2650 177 34 121 364 3 
Buick—Century 4816 20 3 3.45 29.0 16 3250 196 40 196 2.93 1 
Buick—Electra 7827 15 4 4 40 31.5 20 4080 222 43 350 241 1 
Buick—Le-Sabre 5788 18 3 4 40 30.5 21 3670 218 43 231 2.73 1 
Buick—-Opel 4453 26 = - 30 240 10 2230 170 34 304 287 1 
Buick-Regal 5189 20 3 3. 2.0 28.5 16 3280 200 42 196 2.93 1 
Buick-Riviera 10372 16 3 4 3.5 30.0 17 3880 207 43 231 2.93 1 
Buick-Skylark 4082 19 3 3. 3.5 27.0 13 3400 200 42 231 3.08 1 
Cad—Deville 11385 14 3 3. 40 31.5 20 4330 221 44 425 2.28 1 
Cad.Eldorado 14500 14 2 2 3.5 30.0 16 3900 204 43 350 219 1 
Cad.—Seville 1590621 3 3. 3.0 30.0 13 4290 204 45 350 2.24 1 
Chev.—Chevette 3299 29 3 3.25 260 9 2110 163 34 231 2.93 1 
Chev.-Impala 5705 16 4 4 40 29.5 20 3690 212 43 250 2.56 1 
Chev.—Malibu 4504 22 3 3. 3.5 285 17 3180 193 41 200 273 1 
Chev.Monte-Carlo 5104.22 2 3. 2.0 285 16 3220 200 41 200 273 1 
Chev.—Monza 3667 24 2 2 20 25.0 7 2750 179 40 151 273 1 
Chev.-Nova 3955 19 3 3. 3.5 27.0 13 3430 197 43 250 2.56 1 
Datsun-200-SX 6229 23 4 3 15 21.0 6 2370 170 35 119 3.89 2 
Datsun—210 4589 35 5 5 20 23.5 8 2020 165 32 85 3.70 2 
Datsun-510 5079 24 4 4 25 220 8 2280 170 34 119 354 2 
Datsun-810 Bing SL 4 4 25 27.0 8 2750 184 38 146 3.55 2 
Dodge-Colt 3984 30 5 4 20 240 8 2120 163 35 98 354 2 
Dodge-Diplomat 5010 18 2 2 40 29.0 17 3600 206 46 318 247 1 
Dodge—-Magnum-XE 5886 16 2 2 3.5 260 16 3870 216 48 318 2.71 1 
Dodge-St.Regis 634217 2 2 45 28.0 21 3740 220 46 225 2.94 1 
Fiat—Strada 4296 21 3 1 25 265 16 2130 161 36 105 337 38 
Ford-Fiesta 4389 28 4 - 15 260 9 1800 147 33 98 315 1 
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Ford—Mustang 
Honda—Accord 
Honda-—Civic 
Linc.—Continental 
Linc.—Cont—Mark—V 
Linc.—Versailles 
Mazda—GLC 
Merc.—Bobcat 
Merc.—Cougar 
Merc.—Cougar—XR-7 
Merc.—Marquis 
Merc.—Monarch 
Merc.—Zephyr 
Olds.—98 
Olds.—Cutlass 
Olds.—Cutl-Supr 
Olds.—Delta—88 
Olds.-Omega 
Olds.—Starfire 
Olds.—Tornado 
Peugeot—604-SL 
Plym.—Arrow 
Plym.—Champ 
Plym.—Horizon 
Plym.—Sapporo 
Plym.—Volare 
Pont.—Catalina 
Pont.—Firebird 
Pont.—Grand-—Prix 
Pont.—Le—Mans 
Pont.—Phoenix 
Pont.—Sunbird 
Renault—Le—Car 
Subaru 
Toyota—Cecila 
Toyota—Corolla 
Toyota—Corona 
VW-Rabbit 
VW-Rabbit—Diesel 
VW-Scirocco 
VW-Dasher 
Volvo—260 


4187 
5799 
4499 
11497 
13594 
13466 
3995 
3829 
5379 
6303 
6165 
4516 
3291 
8814 
4733 
5172 
5890 
4181 
4195 
10371 
12990 
4647 
4425 
4482 
6486 
4060 
5798 
4934 
5222 
4723 
4424 
4172 
3895 
3798 
5899 
3748 
5719 
4697 
5397 
6850 
7140 
11995 
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23.0 
25.5 
23.5 
30.5 
28.5 
27.0 
25.5 
25.5 
29.5 
25.0 
30.5 
27.0 
29.0 
31.5 
28.0 
28.0 
29.0 
27.0 
25.5 
30.0 
30.5 
21.5 
23.0 
25.0 
22.0 
31.0 
29.0 
23.5 
28.5 
28.0 
27.0 
25.0 
23.0 
25.5 
22.0 
24.5 
23.0 
25.5 
25.5 
23.5 
37.5 
29.5 


2650 
2240 
1760 
4840 
4720 
3830 
1980 
2580 
4060 
4130 
3720 
3370 
2830 
4060 
3300 
3310 
3690 
3370 
2720 
4030 
3420 
2360 
1800 
2200 
2520 
3330 
3700 
3470 
3210 
3200 
3420 
2690 
1830 
2050 
2410 
2200 
2670 
1930 
2040 
1990 
2160 
3170 


179 
172 
149 
233 
230 
201 
154 
169 
221 
217 
212 
198 
195 
220 
198 
198 
218 
200 
180 
206 
192 
170 
157 
165 
182 
201 
214 
198 
201 
199 
203 
179 
142 
164 
174 
165 
175 
155 
155 
156 
172 
193 


3.08 
3.05 
3.30 
2.47 
2.47 
2.47 
3.73 
2.73 
2.75 
2.75 
2.26 
2.43 
3.08 
2.41 
2.93 
2.93 
2.73 
3.08 
2.73 
2.41 
3.58 
3.05 
2.97 
3.37 
3.54 
3.23 
2.73 
3.08 
2.93 
2.93 
3.08 
2.73 
3.72 
3.81 
3.06 
3.21 
3.05 
3.78 
3.78 
3.78 
3.74 
2.98 
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B.4 Classic Blue Pullovers Data 


This is a data set consisting of 10 measurements of 4 variables. The story: A textile shop 
manager is studying the sales of “classic blue” pullovers over 10 periods. He uses three 
different marketing methods and hopes to understand his sales as a fit of these variables 
using statistics. The variables measured are 


X,: Numbers of sold pullovers, 

X»_: Price (in EUR), 

X3: Advertisement costs in local newspapers (in EUR), 
X,4: Presence of a sales assistant (in hours per period). 


Sales Price Advert. Ass. Hours 


1 230 125 200 109 
2 181 99 59 107 
3 165 97 105 98 
4 150 115 85 71 
5 97 120 0 82 
6 192 100 150 103 
7 181 80 85 111 
8 189 90 120 93 
9 172 95 110 86 
10 170 125 130 78 
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B.5 U.S. Companies Data 


The data set consists of measurements for 79 U.S. companies. The abbreviations in Table B.5 
are as follows: 


X;: A Assets (USD), 

Xy: § Sales (USD), 

X3: MV Market Value (USD), 
Aa? P Profits (USD), 

Xs5: CF Cash Flow (USD), 
Xs: E Employees. 


Company A S MV P CF E Sector 

Bell Atlantic 19788 9084 10636 1092.9 2576.8 79.4 Communication 
Continental Telecom 5074 2557 1892 239.9 578.3 21.9 Communication 
American Electric Power 13621 4848 4572 485.0 898.9 23.4 Energy 
Brooklyn Union Gas 1117 1038 478 59.7 91.7 3.8 Energy 

Centra IIllinois Publ. Serv. 1633 701 679 74.3 135.9 2.8 Energy 
Cleveland Electric [lum. 5651 1254 2002 310.7 407.9 6.2 Energy 
Columbia Gas System 5835 4053 1601 -93.8 173.8 10.8 Energy 
Florida Progress 3494 1653 1442 160.9 320.3 6.4 Energy 

Idaho Power 1654 451 779 84.8 130.4 1.6 Energy 
Kansas Power & Light 1679 1354 687 93.8 154.6 4.6 Energy 

Mesa Petroleum 1257 355 181 167.5 304.0 0.6 Energy 
Montana Power 1743 597 717 121.6 172.4 3.5 Energy 
Peoples Energy 1440 1617 639 81.7 126.4 3.5 Energy 
Phillips Petroleum 14045 = 15636 2754 418.0 1462.0 27.3 Energy 

Publ. Serv. Coof New Mexico 3010 749 1120 146.3 209.2 3.4 Energy 

San Diego Gas & Electric 3086 1739 1507 202.7 335.2 4.9 Energy 

Valero Energy 1995 2662 341 34.7 100.7 2.3. Energy 
American Savings Bank FSB 3614 367 90 14.1 24.6 1.1 Finance 

Bank South 2788 271 304 23.5 28.9 2.1 Finance 

H & R Block 327 542 959 54.1 72.5 2.8 Finance 
California First Bank 5401 550 376 25.6 37.5 4.1 Finance 

Cigna 44736 16197 4653 --732.5 = -651.9 48.5 Finance 
Dreyfus 401 176 1084 55.6 57.0 0.7 Finance 

First American 4789 453 367 40.2 51.4 3.0 Finance 

First Empire State 2548 264 181 22.2 26.2 2.1 Finance 

First Tennessee National 5249 527 346 37.8 56.2 4.1 Finance 
Marine Corp 3720 356 211 26.6 34.8 2.4 Finance 
Mellon Bank 33406 3222 1413 201.7 246.7 15.8 Finance 
National City 12505 1302 702 108.4 131.4 9.0 Finance 
Norstar Bancorp 8998 882 988 93.0 119.0 7.4 Finance 
Norwest 21419 2516 930 107.6 164.7 15.6 Finance 
Southeast Banking 11052 1097 606 64.9 97.6 7.0 Finance 
Sovran Financial 9672 1037 829 92.6 118.2 8.2. Finance 
United Financial Group 4989 518 53 -3.1 -0.3 0.8 Finance 

Apple Computer 1022 1754 1370 72.0 119. 4.8 HiTech 
Digital Equipment 6914 7029 7957 400.6 754.7 87.3 HiTech 

Eg &G 430 1155 1045 55.7 70.8 22.5 HiTech 
General Electric 26432 28285 33172 2336.0 3562.0 304.0 HiTech 
Hewlett-Packard 5769 6571 9462 482.0 792.0 83.0 HiTech 

IBM 52634 50056 95697 6555.0 9874.0 400.2 HiTech 

NCR 3940 4317 3940 315.2 566.3 62.0 HiTech 

Telex 478 672 866 67.1 101.6 5.4 HiTech 
Armstrong World Industries 1093 1679 1070 100.9 164.5 20.8 Manufacturing 
CBI Industries 1128 1516 430 -47.0 26.7 13.2. Manufacturing 
Fruehauf 1804 2564 483 70.5 164.9 26.6 Manufacturing 
Halliburton 4662 4781 2988 28.7 371.5 66.2. Manufacturing 
LTV 6307 8199 598 = -771.5 = -524.3 57.5 Manufacturing 
Owens-Corning Fiberglas 2366 3305 1117 131.2 256.5 25.2 Manufacturing 
PPG Industries 4084 4346 3023 302.7 521.7 37.5 Manufacturing 
Textron 10348 5721 1915 223.6 322.5 49.5 Manufacturing 
Turner 752 2149 101 11.1 15.2 2.6 Manufacturing 
United Technologies 10528 14992 5377 312.7 710.7 184.8 Manufacturing 
Commun. Psychiatric Centers 278 205 853 44.8 50.5 3.8 Medical 
Hospital Corp of America 6259 4152 3090 283.7 524.5 62.0 Medical 
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AH Robins 707 706 275 61.4 77.8 6.1 Medical 
Shared Medical Systems 252 312 883 41.7 60.6 3.3. Medical 

Air Products 2687 1870 1890 145.7 352.2 18.2. Other 

Allied Signal 13271 9115 8190 = -279.0 83.0 143.8 Other 

Bally Manufactoring 1529 1295 444 25.6 137.0 19.4 Other 

Crown Cork & Seal 866 1487 944 71.7 115.4 12.6 Other 
Ex-Cell-0 799 1140 633 57.6 89.2 15.4 Other 

Liz Claiborne 223 557 1040 60.6 63.7 1.9 Other 

Warner Communications 2286 2235 2306 195.3 219.0 8.0 Other 
Dayton-Hudson 4418 8793 4459 283.6 456.5 128.0  Retai 

Dillard Department Stores 862 1601 1093 66.9 106.8 16.0 Retai 

Giant Food 623 2247 797 57.0 93.8 18.6 Retai 

Great A & P Tea 1608 6615 829 56.1 134.0 65.0  Retai 

Kroger 4178 17124 2091 180.8 390.4 164.6 Retai 

May Department Stores 3442 5080 2673 235.4 361.5 77.3 Retai 

Stop & Shop Cos 1112 3689 542 30.3 96.9 43.5 Retai 
Supermarkets General 1104 5123 910 63.7 133.3 48.5 Retai 

Wickes Cos 2957 2806 457 40.6 93.5 50.0 = Retai 

FW Woolworth 2535 5958 1921 177.0 288.0 118.1  Retai 

AMR 6425 6131 2448 345.8 682.5 49.5 ‘Transportation 
IU International 999 1878 393 --173.5 = -108.1 23.3 Transportation 
PanAm 2448 3484 1036 48.8 257.1 25.4 ‘Transportation 
Republic Airlines 1286 1734 361 69.2 145.7 14.3. Transportation 
TWA 2769 3725 663 = -208.4 12.4 29.1 ‘Transportation 
Western AirLines 952 1307 309 35.4 92.8 10.3. Transportation 
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B.6 French Food Data 


The data set consists of the average expenditures on food for several different types of 
families in France (manual workers = MA, employees = EM, managers = CA) with different 
numbers of children (2,3,4 or 5 children). The data is taken from Lebart, Morineau and 
Fénelon (1982). 


bread vegetables fruits meat poultry milk wine 


1 MA2 332 428 304 1437 526 247 427 

2 EM2 293 559 388 1527 567 239 258 

3 CA2 372 767 562 1948 927 239 433 

4 MA83 406 563 341 1507 544 324 407 

5 EM3 386 608 396 1501 598 319 363 

6 CA8 438 843 689 2345 1148 243 341 

7 MA4 534 660 367 1620 638 414 407 

8 EM4 460 699 484 1856 762 400 416 

9 CA4 385 789 621 2366 1149 304 282 

10 MA5 655 776 423 1848 759 495 486 
11 EM5 584 995 548 2056 893 518 319 
12) CA5 515 1097 887 2630 1167 561 284 
r 446,7 737,8 505,0 1886,7 803,2 358,2 368,6 


sx,x, 102,6 172,2 1581 3789 2389 1121 68,7 
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B.7 Car Marks 


The data are averaged marks for 24 car types from a sample of 40 persons. The marks range 
from 1 (very good) to 6 (very bad) like German school marks. The variables are: 


A Economy, 

B_ Service, 

C  Non-depreciation of value, 

D_ Price, Mark 1 for very cheap cars 

E Design, 

F Sporty car, 

G Safety, 

H_ Easy handling. 
Type Model Economy Service Value Price Design Sport. Safety Easy h. 
Audi 100 3.9 2.8 2:2 4.2 3.0 3.1 2.4 2.8 
BMW 5 series 4.8 1.6 1.9 5.0 2.0 2.5 1.6 2.8 
Citroen AX 3.0 3.8 3.8 256. 4.0 4.4 4.0 2.6 
Ferrari 5.3 2.9 2.2 5.9 1.7 1.1 3.3 4.3 
Fiat Uno 2.1 3.9 4.0 2.6 4.5 4.4 4.4 2.2 
Ford Fiesta 2.3 3.1 3.4 2.6 3.2 3.3 3.6 2.8 
Hyundai 2.5 3.4 3.2 2.2 3.3 3.3 3.3 2.4 
Jaguar 4.6 2.4 1.6 5.5 1.3 1.6 2.8 3.6 
Lada Samara 3.2 3.9 4.3 2.0 4.3 4.5 4.7 2.9 
Mazda 323 2.6 3.3 3.7 2.8 3.7 3.0 3.7 3.1 
Mercedes 200 4.1 1.7 1.8 4.6 2.4 3.2 1.4 2.4 
Mitsubishi Galant 3.2 2.9 3.2 3.5 3.1 3.1 2.9 2.6 
Nissan Sunny 2.6 3.3 3.9 2.1 3.5 3.9 3.8 2.4 
Opel Corsa 2n2. 2.4 3.0 2.6 3.2 4.0 2.9 2.4 
Opel Vectra 3.1 2.6 2.3 3.6 2.8 2.9 2.4 2.4 
Peugeot 306 2.9 3.5 3.6 2.8 3.2 3.8 3.2 2.6 
Renault 19 2.7 3.3 3.4 3.0 3.1 3.4 3.0 2.7 
Rover 3.9 2.8 2.6 4.0 2.6 3.0 3.2 3.0 
Toyota Corolla 2.5 2.9 3.4 3.0 3.2 3.1 3.2 2.8 
Volvo 3.8 2.3 1.9 4.2 3.1 3.6 1.6 2.4 
Trabant 601 3.6 4.7 5.5 1.5 4.1 5.8 5.9 3.1 
VW Golf 2.4 2.1 2.0 2.6 3.2 3.1 3.1 1.6 
VW Passat 3.1 2.2 2.1 3.2 3.5 3.5 2.8 1.8 
Wartburg 1.3 3.7 4.7 5.5 1.7 4.8 5.2 5.5 4.0 
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B.8 French Baccalauréat Frequencies 


The data consists of observations of 202100 baccalauréats from France in 1976 and give 
the frequencies for different sets of modalities classified into regions. For a reference see 
Bourouche and Saporta (1980). The variables (modalities) are: 


X,: A Philosophy-Letters, 

X: B Economics and Social Sciences, 

X3: C Mathematics and Physics, 

X4: D Mathematics and Natural Sciences, 

X5: E Mathematics and Techniques, 

Xe: F Industrial Techniques, 

X7: G_ Economic Techniques, 

Xg: H Computer Techniques. 
Abbrev. _ Region A B C D E F G H total 
ILDF Tle-de-France 9724 5650 8679 9432 839 3303 5300 83 43115 
CHAM Champagne-Ardennes 924 464 567 984 132 423 736 12 4242 
PICA _ Picardie 081 490 830 1222 18 410 743 13 4907 
HNOR Haute-Normandie 1135 587 686 904 83 629 813 13 4850 
CENT Centre 1482 ««667,— «1020-1535. s«173,s«29—«i8DsG—~*«CH NL 
BNOR Basse-Normandie 033 509 553 1063 100 433 742 13 4446 
BOUR Bourgogne 272 527 ~—-861 116 219 769 1232 13 6009 
NOPC Nord - Pas-de-Calais 2549 1141 2164 2752 587 1660 1951 41 12845 
LORR Lorraine 828 681 1364 1741 302 1289 1683 15 8903 
ALSA Alsace 1076 «=6.443—(iws880—«id22.'s—sd145Ss«S917—Ss—«a209s'i‘ia2S COBB 
FRAC Franche-Comté 827 333 481 892 137 451 618 18 3757 
PAYL Pays de la Loire 2213. «809 = 1439'S «2623'S s«269'-Ss«990—S—s«d1783)—)Ss«14——s«10140 
BRET __ Bretagne 2158 1271 1633 2352 350 950 1509 22 10245 
PCHA __ Poitou-Charentes 1358 503 639 1377 164 495 959 10 5505 
AQUI Aquitaine 2757 + 873.—«:1466 = 2296S 215. Ss 789=Ss«1459S17-—Ss«872 
MIDI Midi-Pyrénées 2493 1120 1494 2329 254 855 1565 28 10138 
LIMO _ Limousin 551 297 386 663 67 334 378 12 2688 
RHOA _ Rhénes-Alpes 3951 2127 3218 «= 4743, «545S 2072S «3018 = 36 ~—=«197110 
AUVE Auvergne 1066 579 724 1239 126 476 «649 12 4871 
LARO Languedoc-Roussillon 1844 816 1154 1839 156 469 993 16 7287 
PROV Provence-Alpes-Céte d’Azur 3944 1645 2415 3616 343 1236 2404 22 15625 
CORS Corse 327 31 85 178 9 27 790 736 


total 45593-21563 32738 = 46017 = 55333-19656 = 30749 = 451 202100 
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B.9 Journaux Data 


This is a data set that was created from a survey completed in the 1980‘s in Belgium 
questioning people’s reading habits. They were asked where they live (10 regions comprised 
of 7 provinces and 3 regions around Brussels) and what kind of newspaper they read on a 
regular basis. The 15 possible answers belong to 3 classes: Flemish newspapers (first letter 
v), French newspapers (first letter f) and both languages (first letter b). 


X,:  WaBr Walloon Brabant 
X»:  Brar Brussels area 

X3: Antw Antwerp 

X,: *FlBr — Flemish Brabant 
Xs: OcFl Occidental Flanders 
Xe:  OrFl Oriental Flanders 
X7: Hain Hainaut 

Xs: Lieég Liege 

X : Limb Limburg 

Xj9: Luxe Luxembourg 


WaBr Brar Antw FIBr OcFl OrFl Hain Liég Limb Luxe 
Va 1.8 7.8 9.1 3.0 4.3 3.9 0.1 0.3 3.3 0.0 
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B.10 U.S. Crime Data 


This is a data set consisting of 50 measurements of 7 variables. It states for one year (1985) 
(x i number of crimes in the 50 states of the U.S. classified according to 7 categories 
X3—Xo). 


X;: — land area (land) 

X2: population 1985 (popu 1985) 
X3: murder (murd) 

X4: rape 

X5: robbery (robb) 

X¢6: assault (assa) 

X7: burglary (burg) 

Xg: — larcery (larc) 

Xg: autothieft (auto) 

Xi9: US states region number (reg) 
X1,: US states division number (div) 


division numbers region numbers 


New England 1] Northeast 1 
Mid Atlantic 2 | Midwest 2 
EN Central 3 | South 3 
WN Central 4 | West 4 
S Atlantic 5 
E S Central 6 
WS Central 7 
Mountain 8 
Pacific 9 
abb. of Iand popu murd rape robb assa burg larc auto reg div 
state area 1985 
ME 33265 1164 1.5 7.0 12.6 62 562 1055 146 I I 
NH 9279 998 2.0 6 12.1 36 566 929 172 1 1 
VT 9614 535 1.3 10.3 7.6 55 731 969 124 1 1 
MA 8284 5822 3.5 12.0 99.5 88 1134 1531 878 1 1 
RI 1212 968 3.2 3.6 78.3. 120 1019 2186 859 1 1 
CT 5018 3174 3.5 9.1 70.4 87 1084 1751 484 1 1 
NY 49108 17783 7.9 15.5 443.3 209 1414 2025 682 1 2 
NJ 7787 7562 5.7 12.9 169.4 90 1041 1689 557 1 2 
PA 45308 11853 5.3 11.3 106.0 90 594 11 340 1 2 
OH 41330 10744 6.6 16.0 145.9 116 854 1944 493 2 3 
IN 36185 5499 4.8 17.9 107.5 95 860 1791 429 2 3 
IL 56345 11535 9.6 20.4 251.1 187 765 2028 518 2 3 
MI 58527 9088 94 27.1 346.6 193 1571 2897 464 2 3 
WI 56153 4775 2.0 6.7 33.1 44 539 1860 218 2 3 
MN 84402 4193 2.0 9.7 89.1 51 802 1902 346 2 4 
IA 56275 2884 1.9 6.2 28.6 48 507 1743 175 2 4 
MO 69697 5029 10.7 27.4 2.8 167 1187 2074 538 2 4 
ND 70703 685 0.5 6.2 6.5 21 286 1295 91 2 4 
SD 77116 708 3.8 I1.1 17.1 60 471 1396 94 2 4 
NE 77355 1606 3.0 9.3 57.3 115 505 1572 292 2 4 
KS 82277 2450 4.8 14.5 75.1 108 882 2302 257 2 4 
DE 2044 622 7.7 18.6 105.5 196 1056 2320 559 3 5 
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MD 10460 4392 
VA 40767 5706 
WV 24231 =: 1986 
NC 52669 = 6255 
SC 3111333347 
GA 58910 5976 
FL 58664 11366 
KY 40409 3726 
TN 42144 4762 
AL 51705 4021 
MS 47689 = _2613 
AR 53187 = 2359 
LA 47751 = 4481 
OK 69956 = 33801 
TX 266807 16370 
MT 147046 826 
ID 83564 15 
WY = 97809 509 
CO 104091 =. 3231 
NM 121593 =: 1450 
AZ 1140 =. 33187 
UT 384899 =: 1645 
NV 110561 936 
WA 68138 4409 
OR = 97073 ~—- 2687 
CA 158706 26365 
AK 5914 521 
HI 6471 1054 


338.6 253 1051 2417 548 
92.0 143 806 1980 297 
27.3 84 389774 92 
53.0 293 766 1338 169 
60.1 193 1025 1509 256 
95.8 177 9 1869 309 

186.1 277 1562 2861 397 
72.8 123 704 1212 346 
82.0 169 807 1025 289 
50.3 215 763 1125 223 
19.0 140 351 694 78 
456 150 885 1211 109 

140.8 238 890 1628 385 
54.9 127 841 1661 280 

134.1 195 1151 2183 394 
22.3 75 594 1956 222 
20.5 86 674 2214 144 
22.0 73 646 2049 165 

129.1 185 1381 2992 588 
66.1 196 1142 2408 392 

120.2 214 1493 3550 ~=501 
53.1 70 915 2833 316 

188.4 182 1661 3044 661 
93.5 106 1441 2853 362 

102.5 1382 1273 2825 333 

206.9 226 1753 3422 689 
71.8 168 790 2183 551 
63.3 43 1456 3106 581 
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B.11 Plasma Data 


In Olkin and Veath (1980), the evolution of citrate concentration in the plasma is observed 
at 3 different times of day, X, (8 am), X2 (11 am) and _X3 (3 pm), for two groups of patients. 


Each group follows a different diet. 


Xi: Sam 
Xo: llam 
X3: 3pm 


Group (8am) (llam) (38 pm) 
125 137 121 

144 173 147 

I 105 119 125 
151 149 128 

137 139 109 

93 121 107 

116 135 106 

II 109 83 100 
89 95 83 

116 128 100 
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B.12 WAIS Data 


Morrison (1990) compares the results of 4 subtests of the Wechsler Adult Intelligence Scale 
(WAIS) for 2 categories of people: in group | are n; = 37 people who do not present a senile 
factor, group 2 are those (nz = 12) presenting a senile factor. 


WAIS subtests: 


Xi: information 
X92: similarities 
X3: arithmetic 
Xa: picture completion 
Group II 
subject information similarities arithmetic picture completion 
1 9 5 10 8 
2 10 0 6 2 
3 8 9 11 1 
4 13 7 14 9 
5 4 0 4 0 
6 4 0 6 0 
7 11 9 9 8 
8 5 3 3 6 
9 9 Te 8 6 
10 7 2 6 4 
11 12 10 14 3 
12 13 12 11 10 
Mean 8.75 5.33 8.50 4.75 
Group I 
subject information similarities arithmetic picture completion 
1 7 5 9 8 
2 8 8 5 6 
3 16 18 11 9 
4 8 3 7 9 
5 6 3 13 9 
6 11 8 10 10 
7 12 7 9 8 
8 8 11 9 3 
9 14 12 11 4 
10 13 13 13 6 
11 13 9 9 9 
12 13 10 15 7 
13 14 11 12 8 
14 15 11 11 10 
15 13 10 15 9 
16 10 5 8 6 
17 10 3 7 7 
18 17 13 13 7 
19 10 6 10 7 
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20 10 10 15 8 
21 14 7 11 5 
22 16 11 12 11 
23 10 7 14 6 
24 10 10 9 6 
25 10 7 10 10 
26 7 6 5 9 
27 15 12 10 6 
28 Lf 15 15 8 
29 16 13 16 9 
30 13 10 17 8 
3l 13 10 17 10 
32 19 12 16 10 
33 19 15 17 11 
34 13 10 7 8 
30 15 11 12 8 
36 16 9 11 11 
37 14 13 14 9 


Mean 12.57 9.57 11.49 7.97 


466 B Data 


B.13)> ANOVA Data 


The yields of wheat have been measured in 30 parcels which have been randomly attributed 
to 3 lots prepared by one of 3 different fertilizers A, B, and C. 


X,: fertilizer A 


X»: fertilizer B 
X3: fertilizer C 


Fertilizer A 
Yield 


w 
Q 
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B.14 Timebudget Data 


In Volle (1985), we can find data on 28 individuals identified according to sex, country where 
they live, professional activity and matrimonial status, which indicates the amount of time 
each person spent on ten categories of activities over 100 days (100-24h = 2400 hours total 
in each row) in the year 1976. 


X,: prof: professional activity 

X»: tran: transportation linked to professional activity 
X3: hous: household occupation 

X4: kids: occupation linked to children 
Xs5: shop: shopping 

Xe: pers: time spent for personal care 
X7: eat: eating 

Xg:  slee: sleeping 

Xg: tele: watching television 

Xo: leis: other leisures 

maus: active men in the U.S. 

waus: active women in the U.S. 

wnus:  nonactive women in the U.S. 


mmus: married men in U.S. 

wmus: married women in U.S. 

msus: single men in U.S. 

wsus: single women in U.S. 

mawe: active men from Western countries 
wawe: active women from Western countries 
wnwe: nonactive women from Western countries 
mmwe: married men from Western countries 
wmwe: married women from Western countries 
mswe: single men from Western countries 
wswe: single women from Western countries 
mayo: active men from yugoslavia 

wayo: active women from yugoslavia 

wnyo: nonactive women from yugoslavia 
mmyo: married men from yugoslavia 

wmyo: married women from yugoslavia 


msyo: single men from yugoslavia 

wsyo: single women from yugoslavia 

maes: active men from eastern countries 

waes: active women from eastern countries 
wnes: nonactive women from eastern countries 


mmes: married men from eastern countries 
wmes: married women from eastern countries 
mses: single men from eastern countries 
wses: single women from eastern countries 
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prof tran hous kids shop pers eat slee tele leis 
maus 610 140 60 10. ~=-:120 95 115 760 175 315 
waus A475 90 250 30 140 120 100 775 115 305 
wnus 10 O 495 110 170 110 130 785 160 430 
mmus 615 140 65 10-115 90 115 765 180 305 
wmus 179 29. +421 87 161 112 119 776 143 373 
msus 585 115 50 O 150 105 100 760 150 385 
wsus 482 94 196 18 141 130 96 775 132 336 
mawe 653 100 95 vi 57 85 150 808 115 330 
wawe 511 70 =307 30 80 95 142 816 87 262 
wnwe 20 7 568 87 112 90 180 843 125 368 
mmwe 656 97 97 10 52 85 152 808 122 321 
wmwe 168 22 028 69 102 83 174 824 119 311 
mswe 643 105 72 0 62 77 140 813 100 388 
wswe 429 34 =. 262 14 92 97 147 849 84 392 
mayo 650 140 120 15 85 90 105 760 70 365 
wayo 560 105 375 45 90 90 95 745 60 235 
wnyo 10 10. = 710 55 = 145 85 130 815 60 380 
mmyo 650 145 112 15 85 90 105 760 80 358 
wmyo 260 52 576 59 ~=116 85 117 775 65 295 
msyo 615 125 95 O 115 90 85 760 40 475 
wsyo 433 89 318 23 «112 96 102 774 45 408 
maea 650 142 122 22 76 94 100 764 96 334 
waea 578 106 338 42 106 94 92 752 64 228 
wnea 24 8 594 72 ~=158 92 128 840 86 398 
mmea 652 133 134 22 68 94 102 763 122 310 
wmea 436 79 433 60 119 90 107 772 73 = 231 
msea 627 148 68 0 88 92 86 770 58 463 
wsea 434 86 297 21 129 102 94 799 58 380 
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B.15 Geopol Data 


This data set contains a comparison of 41 countries according to 10 different political and 
economic parameters. 


X,: popu population 

X 2: giph Gross Internal Product per habitant 

X3:  ripo rate of increase of the population 

X4:  rupo- rate of urban population 

Xs5: ripo — rate of illiteracy in the population 

X6:  rspo rate of students in the population 

X7:  eltp expected lifetime of people 

Xg: rnnr_ rate of nutritional needs realized 

Xg:  nunh number of newspapers and magazines per 1000 habitants 

Xj9: nuth number of television per 1000 habitants 

AFS South Africa DAN Denmark MAR. Marocco 

ALG Algeria EGY Egypt MEX Mexico 

BRD Germany ESP = Spain NOR Norway 

GBR Great Britain FRA France PER Peru 

ARS Saudi Arabia GAB Gabun POL Poland 

ARG Argentine GRE _ Greece POR Portugal 

AUS Australia HOK Hong Kong SUE Sweden 

AUT Austria HON Hungary SUI Switzerland 

BEL Belgium IND India THA  Tailand 

CAM Cameroon IDO Indonesia URS USSR 

CAN Canada ISR Israel USA USA 

CHL Chile ITA Italia VEN Venezuela 

CHN- China JAP Japan YOU Yugoslavia 

CUB Cuba KEN Kenia 

popu giph ripo rupo rlpo rspo eltp rnnr nunh_ nuth 

AFS 37 2492 2 58.9 44 1.08 60 120 48 98 
ALG 24.6 1960 3 44.7 50.4 0.73 64 112 21 71 
BRD 62 19610 04 86.4 2. “202 72 = 145 585 759 
GBR 57.02 14575 0.04 92.5 2.2 1.9 75 128 421 435 
ARS 14.4 5980 2.7 77.3 48.9 0.91 63 125 34 269 
ARG 32.4 2130 1.6 86.2 6.1 2.96 71 136 82 217 
AUS 16.81 16830 14 85.5 5) 2.5 76 =125 252 484 
AUT 7.61 16693 O 57.7 1.5 2.52 74 130 362 487 
BEL 9.938 152438 0.2 96.9 3 2.56 74 150 219 320 
CAM 11 1120. 2.7 49.4 58.8 0.17 53 88 6 12 
CAN 26.25 20780 0.9 76.4 1 6.89 77 ~=129 321 586 
CHL 12.95 1794 16 856 89 1.73 71 106 67 183 
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CHN 
CUB 
DAN 
EGY 
ESP 
FRA 
GAB 
GRE 
HOK 
HON 
IND 
IDO 
ISR 
ITA 
JAP 
KEN 
MAR 
MEX 
NOR 
PER 
POL 
POR 
SUE 
SUI 
THA 
URS 
USA 
VEN 
YOU 


1119 
10.5 
5.13 

52.52 
39.24 
56.1 
1.1 
10 

5.75 

10.6 
810 
179 

4.47 

57.59 
123.2 
23.88 
24.51 
84.3 
4.2 
21.75 
38 
10.5 

8.47 

6.7 
55.45 
289 
247.5 
19.2 
23.67 


21.4 
74.9 
86.4 
48.8 
78.4 
74.1 
45.7 
62.60 
100 
60.3 
28 
28.8 
91.6 
68.6 
77 
23.6 
48.5 
72.6 
74.4 
70.2 
63.2 
33.3 
84 
59.6 
22.6 
67.5 
74 
90 
50.2 


34.5 


0.16 
2.38 
2.38 
1.67 
2.59 
2.63 
0.36 
1.89 
1.34 
0.93 
0.55 
0.55 
2.62 
2.25 

2.1 
0.11 
0.86 
1.55 
2.74 
2.04 

1.3 
1.99 
2.21 
1.87 
1.59 
1.76 
5.01 

2.6 
1.44 


69 
75 
75 
59 
77 
76 
52 
76 
77 
70 
57 
60 
75 
75 
78 
58 
61 
68 
77 
61 
71 
74 
77 
77 
65 
69 
75 
69 
72 


111 
135 
131 
132 
137 
130 
107 
147 
121 
135 
100 
116 
118 
139 
122 

92 
118 
120 
124 

93 
134 
128 
113 
128 
105 
133 
138 
102 
139 


36 
129 
309 

39 

79 
193 

14 
102 
521 
273 

28 

21 
253 
105 
566 

13 

12 
124 
5ol 

3l 
184 

70 
526 
504 

46 
A474 
259 
164 
100 
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B.16 U.S. Health Data 


This is a data set consisting of 50 measurements of 13 variables. It states for one year (1985) 
the reported number of deaths in the 50 states of the U.S. classified according to 7 categories. 


X,: — land area (land) 

X2: population 1985 (popu) 

X3: accident (acc) 

X,4: cardiovascular (card) 

Xs: cancer (canc) 

X¢: pulmonar (pul) 

X7: pneumonia flu (pnue) 

Xg:  diabetis (diab) 

Xg: _ liver (liv) 

X19: Doctors (doc) 

X11: Hospitals (hosp) 

Xji2: U.S. states region number (r) 
Xi3: U.S. states division number (d) 


division numbers region numbers 


New England 1] Northeast 1 

Mid Atlantic 2 | Midwest 2 

EN Central 3 | South 3 

WN Central 4 | West 4 

S Atlantic 5 

ES Central 6 

WS Central 7 

Mountain 8 

Pacific 9 

state Tand pepe acc card canc pul pneu diab liv doc hosp r d 


ME 332651164. 337.7 466.2) 213.8 336 21.1 156 145 1773 47 
NH 9279 998 35.9 395.9 182.2 296 20.1 17.6 104 1612 34 
VT 9614 535 41.3 433.1 188.1 33.1 240 156 13.1 1154 19 
MA 8284 5822 31.1 460.6 219 249 29.7 16.0 13.0 16442 177 

RI 1212 968 28.6 474.1 231.5 274 17.7 26.2 13.4 2020 21 
CT 5018 93174 «935.38 423.8 205.1 23.2 22.4 15.4 11.7 8076 65 
NY 49108 17783 31.5 499.5 209.9 23.9 26.0 17.1 17.7 49304 338 
NJ 7787) = 7562) 332.2 464.7 216.3 23.3 19.9 17.3 14.2 15120 131 
PA 45308 11853 34.9 508.7 223.6 27.0 20.1 20.4 12.0 23695 307 
OH 41330 10744 33.2 443.1 198.8 27.4 180 189 10.2 18518 236 


WNWNNNNNNNNNNKRPERHEEHRRB RHE H 
PER EEK BWWWWWNNNRRRREH 


IN 36185 5499 37.7 435.7 184.6 27.2 18.6 17.2 84 7339 133 
IL 56345 11585 32.9 449.6 193.2 22.9 21.3 15.3 12.5 22173 279 
MI 58527 9088 34.3 420.9 182.3 24.2 18.7 148 13.7 15212 231 
WI (56153) 4775) 333.8 «444.3 189.4 22.5 21.2 15.7 8.7 7899 163 
MN 84402 4193 35.7 398.3 174 23.4 256 13.5 81 8098 181 
TA 56275 2884 38.6 490.1 199.1 31.2 283 166 7.9 3842 140 
MO 69697 5029 42.2 475.9 211.1 29.8 25.7 15.3 96 8422 169 
ND _— 70703 685 48.2 401 173.7 18.2 25.9 149 7.4 936 58 
SD 77116 708 53.0 495.2 182.1 30.7 32.4 12.8 7.2 833 68 
NE 77355 1606 40.8 479.6 187.4 316 283 135 7.8 2394 110 
KS = 82277) = 2450 «42.9 455.9 183.9 32.3 249 169 7.8 3801 165 


2044 
10460 
40767 
24231 
52669 
31113 
58910 
58664 
40409 
42144 
51705 
47689 
53187 
47751 
69956 
266807 
147046 
83564 
97809 
104091 
121593 
1140 
84899 
110561 
68138 
97073 
158706 
5914 

6471 


622 
4392 
5706 
1936 
6255 
3347 
5976 


11366 


3726 
4762 
4021 
2613 
2359 
4481 
3301 


16370 


826 
15.0 
509 
3231 
1450 
3187 
1645 
936 
4409 
2687 


26365 


521 
1054 


38.8 
35.2 
37.4 
46.7 
45.4 
47.8 
48.2 
46.0 
48.8 
45.0 
48.9 
59.3 
51.0 
52.3 
62.5 
48.9 
59.0 
51.5 
67.6 
44.7 
62.3 
48.3 
39.3 
57.3 
41.4 
41.6 
40.3 
89.8 
32.9 


404.5 
366.7 
365.3 
502.7 
392.6 
374.4 
371.4 
501.8 
442.5 
427.2 
411.5 
422.3 

482 
390.9 
441.4 
327.9 
372.2 
324.8 
264.2 
280.2 
235.6 
331.5 

242 
299.5 
398.1 
387.8 
307.8 
114.6 
216.9 


16.0 
15.8 
20.3 
20.1 
19.8 
19.2 
20.5 
18.3 
22.9 
20.8 
16.8 
19.5 
22.7 
15.8 
24.5 
17.4 
25.1 
22.3 
18.5 
22.8 
17.8 
21.2 
14.5 
13.7 
21.2 
23.1 
22.2 
12.4 
16.8 
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1046 
11961 
9749 
2813 
9355 
4359 
8256 
18836 
5189 
7572 
5157 
2883 
2952 
7061 
4128 
23481 
1058 
1079 
606 
5899 
2127 
5137 
2963 
1272 
7768 
4904 
57225 
545 
1953 
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B.17 Vocabulary Data 


This example of the evolution of the vocabulary of children can be found in Bock (1975). 
Data are drawn from test results on file in the Records Office of the Laboratory School of 
the University of Chicago. They consist of scores, obtained from a cohort of pupils from the 
eighth through eleventh grade levels, on alternative forms of the vocabulary section of the 
Coorperative Reading Test. It provides the following scaled scores shown for the sample of 
64 subjects (the origin and units are fixed arbitrarily). 


Grade 
Subjects 8 9 10 11 Mean 
1.75 2.60 3.76 3.68 2.95 
2 0.90 2.47 2.44 3.43 2.31 
3 0.80 0.93 0.40 2.27 1.10 
4 2.42 4.15 4.56 4.21 3.83 
4 —1.31 -1.381 -—0.66 —2.22 —1.38 
6 
ve 
8 


Ke 


—1.56 1.67 0.18 2.33 0.66 
1.09 1.50 0.52 2.33 1.36 
—1.92 1.03 0.50 3.04 0.66 


9 —1.61 0.29 0.73 3.24 0.66 
10 2.47 3.64 2.87 5.38 3.99 
11 —0.95 0.41 0.21 1.82 0.37 
12 1.66 2.74 2.40 2.17 2.24 
13 2.07 4.92 4.46 4.71 4.04 
14 3.30 6.10 7.19 7.46 6.02 
15 2.75 2.53 4.28 5.93 3.87 
16 2.25 3.38 5.79 4.40 3.96 
17 2.08 1.74 4.12 3.62 2.89 
18 0.14 0.01 1.48 2.78 1.10 
19 0.13 3.19 0.60 3.14 1.77 
20 2.19 2.65 3.27 2.73 2.71 
21 —0.64 -—1.31 —0.37 4.09 0.44 
22 2.02 3.45 5.32 6.01 4.20 
23 2.05 1.80 3.91 2.49 2.56 
24 1.48 0.47 3.63 3.88 2.37 
25 1.97 2.54 3.26 5.62 3.39 
26 1.35 4.63 3.54 5.24 3.69 
27 —0.56 —0.36 1.14 1.34 0.39 
28 0.26 0.08 1.17 2.15 0.92 
29 1.22 1.41 4.66 2.62 2.47 
30 —1.43 0.80 —0.03 1.04 0.09 
3l —1.17 1.66 2.11 1.42 1.00 
32 1.68 1.71 4.07 3.30 2.69 
33 —0.47 0.93 1.30 0.76 0.63 
34 2.18 6.42 4.64 4.82 4.51 
35 4.21 7.08 6.00 5.65 5.73 
36 8.26 9.55 10.24 10.58 9.66 
37 1.24 4.90 2.42 2.54 2.78 


38 5.94 6.56 9.36 7.72 7.40 
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39 0.87 3.36 2.98 1.73 2.14 
40 —0.09 2.29 3.08 3.30 2.15 
Al 3.24 4.78 3.02 4.84 4.10 
42 1.03 2.10 3.88 2.81 2.45 
43 3.08 4.67 3.83 5.19 4.32 
44 1.41 1.75 3.70 3.77 2.66 
45 —0.65 —0.11 2.40 3.03 1.29 
46 1.52 3.04 2.74 2.63 2.48 
AT 0.57 2.71 1.90 2.41 1.90 
48 2.18 2.96 4.78 3.34 3.32 
49 1.10 2.65 1.72 2.96 2.11 
50 0.15 2.69 2.69 3.00 2.26 
51 —1.27 1.26 0.71 2.68 0.85 
52 2.81 5.19 6.33 5.93 5.06 
53 2.62 3.54 4.86 5.80 4.21 
54 0.11 2.25 1.56 3.92 1.96 
59 0.61 1.14 1.35 0.53 0.91 
56 —2.19 —0.42 1.54 1.16 0.02 
57 1.55 2.42 1.11 2.18 1.82 
58 0.04 0.50 2.60 2.61 1.42 
59 3.10 2.00 3.92 3.91 3.24 
60 —0.29 2.62 1.60 1.86 1.45 
61 2.28 3.39 4.91 3.89 3.62 
62 2.57 5.78 5.12 4.98 4.61 
63 —2.19 0.71 1.56 2.31 0.60 
64 —0.04 2.44 1.79 2.64 1.71 
Mean 1.14 2.54 2.99 3.47 2.53 
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B.18 Athletic Records Data 


This data set provides data on athletic records for 55 countries. 


Country 100m 200m 400m 800m 1500m 5000m 10000m Marathon 

(s) (8) (8) (8)— (min) (min) (min) (min) 
Argentina 10.39 20.81 46.84 1.81 3.70 14.04 29.36 137.71 
Australia 10.31 20.06 44.84 1.74 3.57 13.28 27.66 128.30 
Austria 10.44 20.81 46.82 1.79 3.60 13.26 27.72 135.90 
Belgium 10.34 20.68 45.04 1.73 3.60 13.22 27.45 129.95 
Bermuda 10.28 20.58 45.91 1.80 3.75 14.68 30.55 146.61 
Brazil 10.22 20.43 45.21 1.73 3.66 13.62 28.62 133.13 
Burma 10.64 21.52 48.30 1.80 3.85 14.45 30.28 139.95 
Canada 10.17 20.22 45.68 1.76 3.63 13.55 28.09 130.15 
Chile 10.34 20.80 46.20 1.79 3.71 13.61 29.30 134.03 
China 10.51 21.04 47.30 1.81 3.73 13.90 29.13 133.53 
Columbia 10.43 21.05 46.10 1.82 3.74 13.49 27.88 131.35 
Cook Is 12.18 23.20 52.94 2.02 4,24 16.70 35.38 164.70 
Costa Rica 10.94 21.90 48.66 1.87 3.84 14.03 28.81 136.58 
Czech 10.35 20.65 45.64 1.76 3.98 13.42 28.19 134.32 
Denmark 10.56 20.52 45.89 1.78 3.61 13.50 28.11 130.78 
Dom Rep 10.14 20.65 46.80 1.82 3.82 14.91 31.45 154.12 
Finland 10.43 20.69 45.49 1.74 3.61 13.27 27.52 130.87 
France 10.11 20.38 45.28 1.73 3.57 13.34 27.97 132.30 
GDR 10.12 20.33 44.87 1.73 3.96 13.17 27.42 129.92 
FRG 10.16 20.37 44.50 1.73 3.93 13.21 27.61 132.23 
GB 10.11 20.21 44.93 1.70 3.01 13.01 27.51 129.13 
Greece 10.22 20.71 46.56 1.78 3.64 14.59 28.45 134.60 
Guatemala 10.98 21.82 48.40 1.89 3.80 14.16 30.11 139.33 
Hungary 10.26 20.62 46.02 1.77 3.62 13.49 28.44 132.58 
India 10.60 21.42 45.73 1.76 3.73 13.77 28.81 131.98 
Indonesia 10.59 21.49 47.80 1.84 3.92 14.73 30.79 148.83 
Ireland 10.61 20.96 46.30 1.79 3.96 13.32 27.81 132.35 
Israel 10.71 21.00 47.80 1.77 3.72 13.66 28.93 137.55 
Italy 10.01 19.72 45.26 1.73 3.60 13.23 27.52 131.08 
Japan 10.34 20.81 45.86 1.79 3.64 13.41 27.72 128.63 
Kenya 10.46 20.66 44.92 1.73 3.55 13.10 27.80 129.75 


Korea 10.34 20.89 46.90 1.79 3.77 13.96 29.23 136.25 
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P Korea 10.91 21.94 47.30 1.85 3.77 14.13 29.67 130.87 
Luxemburg 10.35 20.77 47.40 1.82 3.67 13.64 29.08 141.27 
Malaysia 10.40 20.92 46.30 1.82 3.80 14.64 31.01 154.10 
Mauritius 11.19 33.45 47.70 1.88 3.83 15.06 31.77 152.23 
Mexico 10.42 21.30 46.10 1.80 3.65 13.46 27.95 129.20 
Netherlands 10.52 29.95 45.10 1.74 3.62 13.36 27.61 129.02 
NZ 10.51 20.88 46.10 1.74 3.54 13.21 27.70 128.98 
Norway 10.55 21.16 46.71 1.76 3.62 13.34 27.69 131.48 
Png 10.96 21.78 47.90 1.90 4.01 14.72 31.36 148.22 
Philippines 10.78 21.64 46.24 1.81 3.83 14.74 30.64 145.27 
Poland 10.16 20.24 45.36 1.76 3.60 13.29 27.89 131.58 
Portugal 10.53 21.17 46.70 1.79 3.62 13.13 27.38 128.65 
Rumania 10.41 20.98 45.87 1.76 3.64 13.25 27.67 132.50 
Singapore 10.38 21.28 47.40 1.88 3.89 15.11 31.32 157.77 
Spain 10.42 20.77 45.98 1.76 3.99 13.31 27.73 131.57 
Sweden 10.25 20.61 45.63 1.77 3.61 13.29 27.94 130.63 
Switzerland 10.37 20.45 45.78 1.78 3.99 13.22 27.91 131.20 
Tapei 10.59 21.29 46.80 1.79 3.77 14.07 30.07 139.27 
Thailand 10.39 21.09 47.91 1.83 3.84 15.23 32.56 149.90 
Turkey 10.71 21.43 47.60 1.79 3.67 13.56 28.58 131.50 
USA 9.93 19.75 43.86 1.73 3.93 13.20 27.43 128.22 
USSR 10.07 20.00 44.60 1.75 3.99 13.20 27.53 130.55 
W Samoa 10.82 21.86 49.00 2.02 4,24 16.28 34.71 161.83 
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B.19 Unemployment Data 


This data set provides unemployment rates in all federal states of Germany in September 
1999. 


No. Federal state Unemployment rate 
1 Schleswig-Holstein 8.7 
2 Hamburg 9.8 
3 Mecklenburg-Vorpommern 17.3 
4 Niedersachsen 9.8 
5 Bremen 13.9 
6  Nordrhein-Westfalen 9.8 
7 Hessen 7.9 
8  Rheinland-Pfalz rot 
9 Saarland 10.4 
10 Baden-Wiirttemberg 6.2 
11 Bayern 5.8 
12 Berlin 15.8 
13 Brandenburg Tit 
14 Sachsen- Anhalt 19.9 
15 Thiuringen 15.1 
16 Sachsen 16.8 


478 


B Data 


B.20 Annual Population Data 


The data shows yearly average population rates for the old federal states (given in 1000 


inhabitants). 


Year Inhabitants Unemployed 
1960 55433 DTI 
1961 56158 181 
1962 56837 155 
1963 57389 186 
1964 57971 169 
1965 58619 147 
1966 59148 161 
1967 59268 459 
1968 59500 323 
1969 60067 179 
1970 60651 149 
1971 61302 185 
1972 61672 246 
1973 61976 Die 
1974 62054 582 
1975 61829 1074 
1976 61531 1060 
1977 61400 1030 
1978 61327 993 
1979 61359 876 
1980 61566 889 
1981 61682 L272 
1982 61638 1833 
1983 61423 2258 
1984 61175 2266 
1985 61024 2304 
1986 61066 2228 
1987 61077 2229 
1988 61449 2242 
1989 62063 2038 
1990 63254 1883 
1991 64074 1689 
1992 64865 1808 
1993 65535 2270 
1994 65858 2556 
1995 66156 2565 
1996 66444 2796 
1997 66648 3021 
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allocation rules, 323 

Andrews’ curves, 39 

angle between two vectors, 75 
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ANOVA - simple analysis of variace, 103 


Bayes discriminant rule, 328 
Bernoulli distribution, 143 
Bernoulli distributions, 143 
best line, 221 
binary structure, 303 
Biplots, 356 
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bootstrap sample, 150 
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canonical correlation, 361 

canonical correlation analysis, 361 

canonical correlation coefficient, 363 

canonical correlation variable, 363 

canonical correlation vector, 363 

centering matrix, 93 

central limit theorem (CLT), 143, 145 

centroid, 312 
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classic blue pullovers, 84 

cluster algorithms, 308 

cluster analysis, 301 

Cochran theorem, 163 
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column space, 77, 221 

common factors, 277 

common principal components, 256 
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complete linkage, 311 

computationally intensive techniques, 421 

concentration ellipsoid, 138 
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conditional covariance, 433 


conditional density, 121 
conditional distribution, 157 
conditional expectation, 127, 432, 433 
conditional pdf, 120 
confidence interval, 145 
confussion matrix, 332 
conjoint measurement analysis, 393 
contingency table, 341 
contrast, 195 
convex hull, 423 
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correlation, 86 

multiple, 160 
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covariance, 82 
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properties, 126 
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duality theorem, 382 


effective dimension reduction directions, 431, 


effective dimension reduction space, 431 
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explained variation, 98 

exploratory projection pursuit, 425 
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F-spread, 16 
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faces, 34 

factor analysis, 275 

factor analysis model, 275 

factor model, 282 

factor score, 291 

factor scores, 291 

factorial axis, 223 

factorial method, 250 

factorial representation, 229, 231 
factorial variable, 223, 230 
factors, 221 

Farthest Neighbor, 311 

Fisher information, 180 

Fisher information matrix, 178, 179 
Fisher’s linear discrimination function, 333 
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flury faces, 35 

fourths, 15 

French food expenditure, 253 
full model, 105 
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maximum likelihood estimator, 174 
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multivariate t-distribution, 168 
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Nearest Neighbor, 311 

non-metric solution, 400 

Nonexistence of a riskless asset, 410 

nonhomogeneous, 94 

nonmetric methods of MDS, 377 

norm of a vector, 74 

normal distribution, 175 

normalized principal components (NPCs), 
249 

null space, 77 


order statistics, 15 
orthogonal complement, 78 
orthogonal matrix, 59 
orthonormed, 223 

outliers, 13 

outside bars, 16 


parallel coordinates plots, 42 

parallel profiles, 205 

partitioned covariance matrix, 156 

partitioned matrices, 68 

PAV algorithm, 384, 405 

pool-adjacent violators algorithm, 384, 405 

portfolio analysis, 407 

portfolio choice, 407 

positive definite, 65 

positive definiteness, 67 

positive or negative dependence, 34 

positive semidefinite, 65, 93 

principal axes, 73 

principal component method, 286 

principal components, 237 

principal components analysis (PCA), 233, 
432, 435 

principal components in practice, 238 

principal components technique, 238 

principal components transformation, 234, 
Zar 

principal factors, 285 

profile analysis, 205 

profile method, 396 

projection matrix, 77 

projection pursuit, 425 

projection pursuit regression, 428 

projection vector, 431 

proximity between objects, 302 

proximity measure, 302 


quadratic discriminant analysis, 330 
quadratic form, 65 

quadratic forms, 65 

quality of the representations, 252 


randomized discriminant rule, 329 
rank, 58 

reduced model, 105 

rotation, 289 

rotations, 76 

row space, 221 

Russel and Rao (RR), 304 


sampling distributions, 142 
scatterplot matrix, 31 
scatterplots, 30 
separation line, 31 
similarity of objects, 303 
Simple Matching, 304 
single linkage, 311 
single matching, 305 
singular normal distribution, 140 
singular value decomposition (SVD), 64, 
228 
sliced inverse regression, 431, 435 
algorithm, 432 
sliced inverse regression II, 433, 434, 436, 
437 
algorithm, 434 
solution 
nonmetric, 403 
specific factors, 277 
specific variance, 278 
spectral decompositions, 63 
spherical distribution, 167 
standardized linear combinations (SLC), 234 
statistics, 142 
stimulus, 395 
Student’s t-distribution, 96 
sum of squares, 105 
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symmetric matrix, 59 
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The CAPM, 417 

total variation, 98 
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variance explained by PCs, 247 
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varimax method, 289 
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